Find median of two sorted arrays in logarithmic time is quite famous interview question (and a Leetcode problem from hard category). In this article I’ll generalize this question and solve it with Python. Let’s code!
Median in two sorted arrays
First, let’s define first problem:
Given two integer arrays (sorted in ascending order) find median value of merged array.
Reminder: median is central element in odd case and average of central elements in even case. This question is not a big deal, when you’re not limited by computational complexity. For example, two sorted arrays can be merged in O(n) and then median value can be returned in O(1).
Because of that, classic version of the problem includes condition that it should be solved in logarithmic time and that makes things more interesting.
K order static in two sorted arrays
Now, let’s state a little more general question: to find k-th element in two sorted arrays.
Given two integer arrays (sorted in ascending order) find k order value of merged array.
This looks kind of the same with one subtle difference, that k order statistic is always element of array. So, arrays are sorted and we want to find some special element in logarithmic time. This gives us a hint that probably some binary search variation should be involved.
Indeed, we can search for element x[j], such as it is larger than f(j) = k - j elements in y. If such element exists it’ll be exactly k order statistic in sorted array. Othewise such element should exist in y, if 0 <= k < n1 + n2.
Here’s the code (32 lines):
def k_order_two_sorted_binsearch(x, y, n, n1, n2, h):
l, r, prev_state, state = 0, n1-1, (0, n1-1), (-1, -1)
while prev_state != state: # is there (l, r) update?
m = (l + r)//2
pivot, f_m = x[ m ], h-m
if f_m > n2: # check boarder cases for f_m values
l = m
elif f_m == n2:
if x[ m ] >= y[ n2 - 1 ]:
return x[ m ]
else:
l = m if l != r-1 else r
elif f_m < 0:
r = m
elif f_m == 0:
if x[ m ] <= y[ 0 ]:
return x[ m ]
else:
r = m
else: # main part
if x[ m ] >= y[ f_m - 1 ] and x[ m ] <= y[ f_m ]:
return x[ m ]
elif x[ m ] < y[ f_m - 1 ]:
l = m if l != r-1 else r # don't repeat this at home
elif x[ m ] > y[ f_m ]:
r = m
prev_state, state = state, (l, r)
def find_k_order_two_sorted(x, y, n1, n2, k):
if r := k_order_two_sorted_binsearch(x, y, n1 + n2, n1, n2, k ):
return r
return k_order_two_sorted_binsearch(y, x, n1 + n2, n2, n1, k )
Before we move to code explanation, I would like to look at this problem from monotonic sequence search view.
Let c[j] be number of elements in y less than x[j], finally let’s look at f[j]-c[j].
Sequence f is monotonic decreasing (number of elements in y that we need, just k-j), c is non-decreasing, so their difference is also monotonic decreasing.
And by definition zero element in difference sequence corresponds to k order statistic.
Unfortunately, we can’t compute f[j]-c[j] in O(1) but luckily we don’t need it. Instead we can compute predicate sign(f[j]-c[j]) in constant time, as it was shown in code. This last observation gives us O(log(n1)) complexity for one k_order_two_sorted_binsearch(x, y, …) call. With symmetrical k_order_two_sorted_binsearch(y, x, …) call that gives us O(log(n1) + log(n2)) complexity.
By the way, f[j]-c[j] can be computed in O(log(n2)) and that would give us O(log(n1) log(n2)), logarithmic-squared complexity, which is not logarithmic enough.
Let’s get back into code. Half of the function k_order_two_sorted_binsearch() is dancing around edge cases for f(m) value. It can be reduced, however I think that code compressing here is not useful for clarity.
This funny line
l = m if l != r-1 else r
is a way to avoid cycle on the last iterations:
(l, l+1) -> (l, l+1) -> (l+1, l+1)
instead of
(l, l+1) -> (l, l+1) -> (l, l+1) -> ...
Let’s move back to median. Observation: median is linear function of two k-order statistics: k = floor((n1+n2-1)/2) and k+1. Now the solution of median problem will look like:
def median_two_sorted_log(x, y):
n1, n2, n = len(x), len(y), len(x) + len(y)
lm = find_k_order_two_sorted(x, y, n1, n2, (n - 1)//2)
if (n1 + n2) % 2 == 1:
return lm
else:
return (lm + find_k_order_two_sorted(x, y, n1, n2, n//2)) / 2
And finally, four logarithmic calls in worst case (even size of merged array and both misses in find_k_order_two_sorted()) isn’t very satisfying result for median problem. If you want more competitive solution you can easily modify k_order_two_sorted_binsearch() in a way it will return k-th and k+1-th order statistics in one function call.
It’s because when you found x[m] which you are going to return from function the next k+1-th element in merged array would be either x[m+1] or y[f_m].
This way the number of logarithmic calls will reduce by factor of two. And such modification ensure complexity O(log(n1) + log(n2)). Modified solution look like this:
def median_two_sorted_log_modified(x, y):
n1, n2, n = len(x), len(y), len(x) + len(y)
lm, rm = find_k_order_two_sorted_modified(x, y, n1, n2, (n - 1)//2)
if (n1 + n2) % 2 == 1:
return lm
else:
return (lm + rm) / 2
Code of find_k_order_two_sorted_modified() is very similar to original version. Try to fix k_order_two_sorted_binsearch() by your own or check out my modified version of code.
How good is logarithmic complexity solution?
Let’s do some fair comparison for the median version of the problem. This is the baseline code:
def merge_two_sorted(x, y):
l, m, result = 0, 0, []
while l != len(x) and m != len(y):
if x[l] < y[m]:
result.append(x[l])
l += 1
else:
result.append(y[m])
m += 1
while l != len(x):
result.append(x[l])
l += 1
while m != len(y):
result.append(y[m])
m += 1
return result
def median_two_sorted_linear(x, y):
merged = merge_two_sorted(x, y)
n = len(merged)
if n % 2 == 1:
return merged[(n-1)//2]
else:
return (merged[(n-1)//2] + merged[n//2])/2
It computes the median value for two sorted lists in linear time.
By the way, can you modify this code to O(k)?
Let’s do some tests for random arrays of sizes 30, 50, 100, 150, 200.
Here’s results showing boost in performance only growing with array size.
Instead of conclusion
In this small article we investigated nice solution of MedianTwoSorted problem. Moreover generalized version of the problem (with k order statistic instead of median) has been solved by original algorithm. You can find full code here.