The algorithm and data structure described here was a result of my attempt at solving a different problem. Though I still don't know if the data structure described here is applicable to my original problem, I describe it here in the hope that it may be applicable to some problems.

The prerequisites

The data structure falls into the family of persistent data structures. Once a node in the data structure is written it is never modified, in other words nodes are immutable. Each new node will point to the previous node. In order for binary search to make sense, the nodes must be added to the list in order according to the comparison function that will subsequently be used for binary searches. If those criteria are satisfied the algorithm can be used.

Complexity

The algorithm given here will allow adding a node to the linked list with work O(log n). This is more expensive than the O(1) complexity of adding the element to a structure that does not support binary search. However the space complexity remains O(1) for adding a node. Searching in the data structure takes time O(log n).

Overview

The idea is that each node in addition to having a single point to the previous node have an additional pointer that skips an exponentially increasing number of nodes such that it is possible to move fast through the list.

Since I want the space complexity to remain constant, I have only one such extra pointer per node. This means it will be necessary to move through a few nodes to find a pointer skipping the desired distance through the list. Consequently the search differs slightly from a conventional binary search in that the actual elements being inspected may be offset up to log(n) positions from where you would usually have looked. This doesn't change the search complexity however.

There need to be pointers skipping powers of 2 steps through the list. That means skip 2, 4, 8, etc. steps. Since each node contains only one such pointer, the structure cycles through the different step lengths. This is achieved by grouping the nodes into groups of log(n) nodes each which skip a number of groups. It will turn out to be convenient to make the skips be a number of groups rather than a number of nodes, in turn it makes sense to start with skipping a single group instead of 2, thus each group will skip 1, 2, 4, etc. groups backward proceeding as far as there are earlier groups to point to. Whenever possible the skip pointer will point to the same position in the destination group as the node it was pointing from.

The following table shows what the first 15 nodes inserted in the structure will contain. The group number is not explicitly represented. It is only shown for clarity. Notice that as the previous pointer is followed from the newest element backwards through the list towards the oldest element, the skip pointer will be increasing except when a group boundary is crossed in which case it will rather be decreasing. That way group boundaries can be recognized without being explicitly represented.

NodePrev pointerSkip pointerGroup
1NULLNULLA
211B
322C
431
543D
652
765E
874
981
1097F
11106
12112
131210G
14138
15144

Exceptions

As can be seen from the table above there is a few exceptions from the general structure of the groups described.

First of all group A should have had no elements, since there are no previous groups to be pointing at. However had group A been empty, there would have been nowhere to point the skip pointers from later groups. Having group A be an exception is desirable compared to the alternative of having an exception for every node that would have been pointing to group A. Thus group A has size 1 and has a pointer skipping one group backwards and thus is a NULL pointer.

Also notice that the principle of having every skip pointer point to the same position in the group it points to has to be broken when the group size grows. For example node 6 which is in group D would point 2 groups backward should have pointed to the same position in group B. But since group B is smaller there is no appropriate location to point to, instead it points to the last element in group B, which is as close as it gets.

Searching

The first stage of searching for an element is the search for a node with a skip pointer skipping as many steps as possible. This is done by following the previous pointer one node at a time until a group boundary is crossed. Of course during this stage values must be compared such that the node will be found even if it is in the newest group, which may not be complete yet. Since each group is of logarithmic size, this takes time O(log n).

Next comes the skipping face. Look at the node pointed to by the skip node. Follow the skip pointer if it does not go too far back. If the skip pointer goes too far back, instead follow the previous pointer to find a node, which only skips half the distance. Notice that this is the reason the skip distances are ordered the way they are. Had the skip distances within each group been in the opposite order, finding a node with half the skip distance would have required a logarithmic number of steps instead of just one, and the overall complexity would have grown from O(log n) to O((log n)^2).

Notice that even following the previous pointer may go too far back. When this happens it means the search is completed and either the desired node was found, or two neighbours have been found where the value being search for should have been between.

The skipping face iterates following either the skip pointer or the previous pointer until the value is found or it is concluded it does not exist.

Complexity of searching

The overall complexity will be logarithmic because the previous pointer can be followed at most 3 log n times, and the skip pointer can be followed at most log n times.

The search algorithm can only use a certain skip length once. If a certain skip length could have been used twice, then instead the algorithm would have used twice the skip length just once.

The search algorithm can only follow the previous pointer 3 log n times. The first stage runs through at most log n nodes in the newest group searching for a group boundary. The next stage will only follow the previous pointer when the skip length need to be reduced. The skip length can only be reduced log n times. Finally once even skipping a single group is too far, the algorithm will instead follow the previous pointer for the rest of the search. But there can only be at most log n steps left since that is the length of the single group skip, which was known to be too far.

Insertion

The only nontrivial aspect of adding a node is where to point the skip pointer. I left this until this point because it needs to make use of the search algorithm described above.

The general case of performing an insertion that is not at the beginning of a group is fairly simple. The new node must skip twice the distance of the previous node. But in that case it suffices to simply follow the skip pointer twice starting from the newest node. Doing this will find the desired group. However it will skip one node too far. The new node must point to an older node with the same skip distance, but it found one with half the skip distance. Since there are no pointers to newer nodes, it may seem what was found so far is in vain. However it is not, because knowing a neighbour to the desired node is a usable criteria to use for applying the search algorithm above. By applying the search algorithm the next newer node can be found in logarithmic time.

The case where the current group is larger than the group being pointed to requires separate handling, which is fairly simple. The search found the next newer node. It may be that that next newer node is actually not in the same group, but rather in the next group. When that happens simply use the node that was found in the first case instead, that is the one found by following the skip pointer twice.

The case where a new group needs to be started needs special treatment. It is not difficult to find the target to use for the skip pointer in this case. Simply follow the previous pointer back to the start of the current group, and that is the target for the new skip pointer. However knowing that this is required is tricky. A new group is needed when it is not possible to skip twice the number of groups compared to the previous node. What happened when trying to skip the desired distance is that the first skip pointer skipped as far as expected, but the second skip pointer skipped a shorter distance. To detect this it is necessary to search back from each of the two nodes whose skip pointers where used. Counting the number of nodes between a specific node and the start of the group will reveal if two nodes have the same skip distance.

The case of the skip distance halving without us knowing was not a problem in the case of the searching algorithm, because if the searching algorithm had known it was not possible to skip the full distance, it would have skipped half the distance anyway.

Posted 2012 Apr 20th.

Fremhævet produkt

Free v4 frontend for your v6 site



Read more...