Description
The DNA sequence is composed of a series of nucleotides abbreviated as 'A'
, 'C'
, 'G'
, and 'T'
.
- For example,
"ACGAATTCCG"
is a DNA sequence.
When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s
that represents a DNA sequence, return all the 10
-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
Input: s = “AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT” Output: [“AAAAACCCCC”,“CCCCCAAAAA”]
Example 2:
Input: s = “AAAAAAAAAAAAA” Output: [“AAAAAAAAAA”]
Constraints:
1 <= s.length <= 105
s[i]
is either'A'
,'C'
,'G'
, or'T'
.
Code
hashmap
最直白的解法,不過在 space complexity 上有優化的空間。
hashmap + bit
hashmap 從 unorder_map<string, int>
變成了 unordered_map<int, int>
,進一步節省空間。
參考 ASCII,A, T, C, G 的 binary 分別為:
A: 0100 0001 T: 0100 0011 C: 0101 0100 G: 0100 0111
可看出最多只需要用最後三個 bit 就可以完全區分四者,因此 s[i] & 7
。
使用 rolling hash,因為 s[i] & 7
是取最後三 bit,所以 left shift 3。& 0x3FFFFFFF
是因為十個字母每個 3 bit 總共 30 bit。
要注意 s.substr(i - 9, 10)
。