Solution to Codility lesson3-exercise3 Genomic Range Query problem, link to Codility problem and complete solution at the end of the post.

**Definitions:**

** S: **string of size

*N,*consisting of characters A, C, G and T.

**Query:**Consists in analyzing the substring S[P[K]] to S[Q[K]] with

, and finding the minimal factor, such that A has factor 1, C has factor 2, G has factor 3 and T has factor 4.

** Problem:**Define a function

*solution*with input

*S*,

*P*and

*Q*, such that

*solution*returns an array with the result of each query from

*0*to

*N – 1*.

- Expected worst-case time complexity is
*O(N + M)*; - Expected worst-case space complexity is
*O(N)*, beyond input storage (not counting the storage required for input arguments).

**Elements of input arrays can be modified.**** **

### Analysis:

To solve this problem we are going to use, Prefix Sums. For each factor of string *S *except for T we are going to made a prefix array where S[K] adds *1* to the prefix array of the factor at that position, and *o* to the others, so when doing the query we have only to check if the sum from P[K] to Q[K] of each prefix array is greater than 1.

28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
# Prefix Sum, sume the occurences of each factor in separated prefix sum Arrays def prefix_sums(A): N = len(A) P_3 = [0]*(N + 1) P_2 = [0]*(N + 1) P_1 = [0]*(N + 1) for k in xrange(1, N + 1): val = impact_factor(A[k - 1]) P_3[k] = P_3[k - 1] + (val%4)/3 P_2[k] = P_2[k - 1] + ((val%4)%3)/2 P_1[k] = P_1[k - 1] + ((val%4)%3)%2 return [P_1, P_2, P_3] |

To calculate the prefix sum we have to visit every character of *S, *since *S * has size *N, * our time Complexity until here is *O(N)*

After doing the prefix sums, we only have to perform the query with the precedence order, factor A, factor C, factor G and factor T, and add the result to the result array.

10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
# Check the ocurrencxe of the minimal factor for each factor for k in xrange(M): # Check is there is any occurence of factor A, on the given range if (pref[0][Q[k] + 1] - pref[0][P[k]]) > 0: result[k] = 1 # Check is there is any occurence of factor C, on the given range elif (pref[1][Q[k] + 1] - pref[1][P[k]]) > 0: result[k] = 2 # Check is there is any occurence of factor G, on the given range elif (pref[2][Q[k] + 1] - pref[2][P[k]]) > 0: result[k] = 3 # If none of the above is true then, the query has only T factor else: result[k] = 4 return result |

To perform all queries we have to traverse both arrays *P* and *Q, *at the same time, both have size *M, *which gives us a time complexity of* O(M).*

To facilitate I defined a function that translates the nucleotide to the corresponding factor.

43 44 45 46 47 48 49 50 51 52 53 54 |
# Translate each nucleotide for it's factor def impact_factor(nucleotide): if nucleotide == "A": return 1 elif nucleotide == "C": return 2 elif nucleotide == "G": return 3 elif nucleotide == "T": return 4 return None |

The total time complexity is *O(N) + O(M) = O(N + M)*.

For the space complexity we have *O(N) *since we have 3 prefix arrays, depending on the size of the string *S*.

**Complete Solution:**

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
# Name: GenomicRangeQuery# Link: https://codility.com/demo/take-sample-test/genomic_range_query/ def solution(S, P, Q): pref = prefix_sums(S) M = len(P) result = [0]*M # Check the ocurrencxe of the minimal factor for each factor for k in xrange(M): # Check is there is any occurence of factor A, on the given range if (pref[0][Q[k] + 1] - pref[0][P[k]]) > 0: result[k] = 1 # Check is there is any occurence of factor C, on the given range elif (pref[1][Q[k] + 1] - pref[1][P[k]]) > 0: result[k] = 2 # Check is there is any occurence of factor G, on the given range elif (pref[2][Q[k] + 1] - pref[2][P[k]]) > 0: result[k] = 3 # If none of the above is true then, the query has only T factor else: result[k] = 4 return result # Prefix Sum, sume the occurences of each factor in separated prefix sum Arraysdef prefix_sums(A): N = len(A) P_3 = [0]*(N + 1) P_2 = [0]*(N + 1) P_1 = [0]*(N + 1) for k in xrange(1, N + 1): val = impact_factor(A[k - 1]) P_3[k] = P_3[k - 1] + (val%4)/3 P_2[k] = P_2[k - 1] + ((val%4)%3)/2 P_1[k] = P_1[k - 1] + ((val%4)%3)%2 return [P_1, P_2, P_3] # Translate each nucleotide for it's factordef impact_factor(nucleotide): if nucleotide == "A": return 1 elif nucleotide == "C": return 2 elif nucleotide == "G": return 3 elif nucleotide == "T": return 4 return None

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
# Name: GenomicRangeQuery # Link: https://codility.com/demo/take-sample-test/genomic_range_query/ def solution(S, P, Q): pref = prefix_sums(S) M = len(P) result = [0]*M # Check the ocurrencxe of the minimal factor for each factor for k in xrange(M): # Check is there is any occurence of factor A, on the given range if (pref[0][Q[k] + 1] - pref[0][P[k]]) > 0: result[k] = 1 # Check is there is any occurence of factor C, on the given range elif (pref[1][Q[k] + 1] - pref[1][P[k]]) > 0: result[k] = 2 # Check is there is any occurence of factor G, on the given range elif (pref[2][Q[k] + 1] - pref[2][P[k]]) > 0: result[k] = 3 # If none of the above is true then, the query has only T factor else: result[k] = 4 return result # Prefix Sum, sume the occurences of each factor in separated prefix sum Arrays def prefix_sums(A): N = len(A) P_3 = [0]*(N + 1) P_2 = [0]*(N + 1) P_1 = [0]*(N + 1) for k in xrange(1, N + 1): val = impact_factor(A[k - 1]) P_3[k] = P_3[k - 1] + (val%4)/3 P_2[k] = P_2[k - 1] + ((val%4)%3)/2 P_1[k] = P_1[k - 1] + ((val%4)%3)%2 return [P_1, P_2, P_3] # Translate each nucleotide for it's factor def impact_factor(nucleotide): if nucleotide == "A": return 1 elif nucleotide == "C": return 2 elif nucleotide == "G": return 3 elif nucleotide == "T": return 4 return None |

*Time Complexity O(N + M)*

*Space Complexity O(N)*