# Codility - Genomic Range Query

Solution to Codility lesson3-exercise3 Genomic Range Query problem, link to Codility problem and complete solution at the end of the post.

Definitions:
$\mathbf{P} = \{x \in \mathbb{N}\ |0\leq x \leq N\land |P| = M\},\mathbf{P}\, is\, a\, multiset.$
$\mathbf{Q} = \{x \in \mathbb{N}\ |0\leq x \leq N\land |Q| = M\},\mathbf{Q}\, is\, a\, multiset.$
$\{N \in \mathbb{N}\ |\ 1 \leq N \leq 100,000\}$
$\{M \in \mathbb{N}\ |\ 1 \leq M\leq 50,000\}$
$p_0\leq q_0\, such\, that\, p_0 \in \mathbf{P}=\{p_0,p_1,p_2,...\}\land q_0 \in \mathbf{Q}=\{q_0,q_1,q_2,...\}$
S: string of size N, consisting of characters A, C, G and T.
Query: Consists in analyzing the substring S[P[K]] to S[Q[K]] with $0\leq K, and finding the minimal factor, such that A has factor 1, C has factor 2, G has factor 3 and T has factor 4.

Problem:
Define a function solution with input SP and Q, such that solution returns an array with the result of each query from 0 to N - 1.

• Expected worst-case time complexity is O(N + M);
• Expected worst-case space complexity is O(N), beyond input storage (not counting the storage required for input arguments).

Elements of input arrays can be modified.

### Analysis:

To solve this problem we are going to use, Prefix Sums. For each factor of string S except for T we are going to made a prefix array where S[K] adds 1 to the prefix array of the factor at that position, and o to the others, so when doing the query we have only to check if the sum from P[K] to Q[K] of each prefix array is greater than 1.

To calculate the prefix sum we have to visit every character of S, since  has size N,  our time Complexity until here is O(N)
After doing the prefix sums, we only have to perform the query with the precedence order, factor A, factor C, factor G and factor T, and add the result to the result array.

To perform all queries we have to traverse both arrays P and Q, at the same time, both have size M, which gives us a time complexity of O(M).
To facilitate I defined a function that translates the nucleotide to the corresponding factor.

The total time complexity is O(N) + O(M) = O(N + M).
For the space complexity we have O(N) since we have 3 prefix arrays, depending on the size of the string S.

### Complete Solution: Genomic Range Query Python # Name: GenomicRangeQuery # Link: https://codility.com/demo/take-sample-test/genomic_range_query/ def solution(S, P, Q): pref = prefix_sums(S) M = len(P) result = [0]*M # Check the ocurrencxe of the minimal factor for each factor for k in xrange(M): # Check is there is any occurence of factor A, on the given range if (pref[0][Q[k] + 1] - pref[0][P[k]]) > 0: result[k] = 1 # Check is there is any occurence of factor C, on the given range elif (pref[1][Q[k] + 1] - pref[1][P[k]]) > 0: result[k] = 2 # Check is there is any occurence of factor G, on the given range elif (pref[2][Q[k] + 1] - pref[2][P[k]]) > 0: result[k] = 3 # If none of the above is true then, the query has only T factor else: result[k] = 4 return result # Prefix Sum, sume the occurences of each factor in separated prefix sum Arrays def prefix_sums(A): N = len(A) P_3 = [0]*(N + 1) P_2 = [0]*(N + 1) P_1 = [0]*(N + 1) for k in xrange(1, N + 1): val = impact_factor(A[k - 1]) P_3[k] = P_3[k - 1] + (val%4)/3 P_2[k] = P_2[k - 1] + ((val%4)%3)/2 P_1[k] = P_1[k - 1] + ((val%4)%3)%2 return [P_1, P_2, P_3] # Translate each nucleotide for it's factor def impact_factor(nucleotide): if nucleotide == "A": return 1 elif nucleotide == "C": return 2 elif nucleotide == "G": return 3 elif nucleotide == "T": return 4 return None 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354 # Name: GenomicRangeQuery# Link: https://codility.com/demo/take-sample-test/genomic_range_query/  def solution(S, P, Q):    pref = prefix_sums(S)    M = len(P)    result = [0]*M     # Check the ocurrencxe of the minimal factor for each factor    for k in xrange(M):        # Check is there is any occurence of factor A, on the given range        if (pref[0][Q[k] + 1] - pref[0][P[k]]) > 0:            result[k] = 1        # Check is there is any occurence of factor C, on the given range        elif (pref[1][Q[k] + 1] - pref[1][P[k]]) > 0:            result[k] = 2        # Check is there is any occurence of factor G, on the given range        elif (pref[2][Q[k] + 1] - pref[2][P[k]]) > 0:            result[k] = 3        # If none of the above is true then, the query has only T factor        else:            result[k] = 4     return result  # Prefix Sum, sume the occurences of each factor in separated prefix sum Arraysdef prefix_sums(A):    N = len(A)    P_3 = [0]*(N + 1)    P_2 = [0]*(N + 1)    P_1 = [0]*(N + 1)     for k in xrange(1, N + 1):        val = impact_factor(A[k - 1])        P_3[k] = P_3[k - 1] + (val%4)/3        P_2[k] = P_2[k - 1] + ((val%4)%3)/2        P_1[k] = P_1[k - 1] + ((val%4)%3)%2     return [P_1, P_2, P_3] # Translate each nucleotide for it's factordef impact_factor(nucleotide):    if nucleotide == "A":        return 1    elif nucleotide == "C":        return 2    elif nucleotide == "G":        return 3    elif nucleotide == "T":        return 4     return None

Time Complexity O(N + M)
Space Complexity O(N)