Repeated DNA Sequences
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Just use an array as hashtable to avoid Memory Limit Exceed.
The total number of combinations of “ACGT” string in the length of 10 is 4^10 which is 2^20. We allocate a sort data type to each combination storing the number of occurrence of it.
The memory use is 1MB.
O(n) in time, O(1) in space.
Warning, when we use “<<” operator, remember to use parentheses to ensure the result is exactly what you want.
//A C G T
//0 1 2 3
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
if(s.size() <= 10) return vector<string>();
short map[0x00100000];
memset(map, 0, sizeof(map));
vector<string> ans;
unordered_map<char, int> idxTable;
idxTable['A'] = 0;
idxTable['C'] = 1;
idxTable['G'] = 2;
idxTable['T'] = 3;
for(size_t i = 0; i < s.size() - 9; i++){
string subs = s.substr(i, 10);
int idx = 0;
for(size_t j = 0; j < subs.size(); j++){
idx = (idx << 2) + idxTable[subs[j]];//BUG HERE, the priority of operators.
}
map[idx]++;
if(map[idx] > 3) map[idx] = 3; //avoid overflow
if(map[idx] == 2){
ans.push_back(subs);
}
}
return ans;
}
};

O(n)space,force solution, Memory Limit Exceed.
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<string, int> map;
if(s.size() <= 10) return vector<string>();
for(size_t i = 0; i < s.size() - 9; i++){
string subs = s.substr(i, 10);
if(map.find(subs) != map.end()){
map[subs] += 1;
}
else{
map[subs] = 1;
}
}
vector<string> ans;
for(auto it = map.begin(); it != map.end(); it++){
if((*it).second > 1){
ans.push_back((*it).first);
}
}
return ans;
}
};