reConnect: Removing Duplicates from sorted data-set in one pass

In order to remove duplicates in a sorted data-set, we can employ several algorithms with varying degrees of efficiency. The algorithm that is discussed in this video, involves a single pass of the data-set, and probably one of the efficient techniques with a worst case efficiency of O(N), for a data-set of size N.

This technique involves repeatedly finding and recognizing the next unique value, (in contrast to finding the duplicates) and re-inserting/copying that next unique value in successive positions starting from position 0, overwriting the values in those positions. Finally, when there are no more unique values, we either resize the data-set to end at the last overwritten position or empty the remaining positions after the last overwritten position.

Three separate variables are engaged in tracking the important parameters of this algorithm:

Last Unique Value Index ( tracked using variable “ui”)

The value pointed by “Last Unique Value”, represents the unique value that is found in our search.

Current Index (tracked using variable “ci”)

The value pointed by “Current Index”, at any given moment, represents the current value that is being checked to identify if it is unique or repeated (hence a duplicate).

Next Replaceable Index (tracked using variable “ni”)

The value pointed by “Next Replaceable Index”, at any given moment, represents the position, where the next identified unique value (identified by checking if value at position “last unique value” is different than value at position “Current Index”) is inserted by replacing the value at that position previously.

Initially, an assumption is made that the value at position “Last Unique Value Index” is the value at position 0, and the current index, as well as next replaceable index are pointing to position 1.

With that assumption, we will repeatedly sought out the next unique value, by comparing the value at “Current Index” with value at “Last unique value Index”, by repeating the following steps until “Current Index” reaches the end of our data set.

Step 1:

If value at “Current Index” with value at “Last unique value Index” are found to be different, we replace the value at position “Next Replaceable Index” with the value at position “Current Index”,and update our “Last unique value index” to point to position “Current Index”.

Step 2:

If value at “Current Index” with value at “Last unique value Index” are found to be the same, then we move our “Current Index” to the next position and repeat Step 1.

These steps and the algorithm is better understood by watching the following video about how this technique works: