Methods for Sequence Analysis

Sequence types

There are two kinds of sequences that can be anlyzed, state sequences and event sequences. State sequences describe the succession of states. In a social sciences setting such sequences could be the marital state of individuals, such as single, married or divorced. In a disease treatment study states could be pre-diagnosed, diagnosed, hospitalized, in rehab, in remission or cured. State sequences therefore consist of stretches of times during which an entity is in a particular state. Event sequences associate an event with a specific time point. In a business context an event sequence could consist of news events such as the appointment of new CEO, release of financial figures, announcement of a new marketing campaign or a product launch. A state sequence can also give rise to an event sequence by considering the transitions from one state to another as an event.

Illustration of data structure

Consider a distance learning institution where students complete a prescribed number of courses at their own pace. At regular intervals, say monthly, the institution keeps track of the number of courses each student has passed. Suppose also that the students have to finish all the courses in a specified length of time. The states in which a student could be are the number of courses already passed by him. The state sequence for a student therefore starts with a number of zeros until he passes the first module. Then there are 1’s, followed by 2’s and so forth. To be specific, let N be the time in which the students must complete the course, hence the sequence length. Let U be the number of courses to pass, hence the number passed at some point ranges from 0 to U. Suppose U =5. We then label the states as F, E, D, C, B and A corresponding to having passed 0,1,...,5 courses, respectively. Suppose there are M students or sequences.

I generated such sequences by randomly picking U =5 values from the time points 1,...,N, where N = 20, and sorting them. This gave the time points at which a student’s state changes from a lower level to one level higher. The first few sequences are:
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20

## 1 F E E D D D D D D D D D C B B B A A A A

## 2 F F E D D D C C C B A A A A A A A A A A

## 3 F E E D D D D D D D D D D C C B B B B A

## 4 F E E E E E E D D C C C C C C B B A A A

## 5 F E E E E D D D D C C C B B B B B B B A

## 6 F F F F F E E E E E E D C C C C C B B A

Vizualization

Several kinds of plots can be produced to gain a visual impression of the data. Figure 1 shows two index plots, the first one simply the first 10 sequences and the second one with all the sequences but sorted by some criterion, here the average completion time for the courses of a sequence. The third plot is the state distribution plot, which gives the frequency of the states in the sequences at the various times. Figure 2 contains a plot of frequencies of modal states, mean time in each state and the transversal entropies.

------ Fig 1 & 2 here -------

Analyzing sequences with dissimilarities

There are several ways of measuring distance or dissimilarity between sequences. These use a distance matrix, computed by optimal matching of pairs of sequences. A portion of such a distance matrix is shown below.

------ Fig 3 here -------

Sequences of news: Specifics of studying corporate action sequences