Your code looks dense because you've stuffed a lot of stuff into a single function, without any structuring whitespace. But that is merely cosmetic and can be fixed. For example, you might structure the code with some blank lines …
fn longest_common_subsequence(s1: Vec<u8>, s2: Vec<u8>) -> Vec<u8> {
return Helper {
s1,
s2,
cache: HashMap::new(),
}
.helper(0, 0);
struct Helper { ... }
impl Helper { ... }
}
… or move the helper outside entirely:
fn longest_common_subsequence(s1: Vec<u8>, s2: Vec<u8>) -> Vec<u8> {
Helper {
s1,
s2,
cache: HashMap::new(),
}
.helper(0, 0)
}
struct Helper { ... }
impl Helper { ... }
I've thought about whether a separate Helper
struct even makes sense. In principle, you could just pass the values like the cache as arguments to a helper function, but given the pair of vecs s1, s2
that should be kept together a struct makes sense. Often, the alternative would be to simply use a closure, but we can't directly have recursive closures.
A different debate can be had about whether the helper()
function should be a method. I'd argue “no” – this is a purely syntactic choice here, and we could instead write the helper function to take a &mut Helper
parameter. But that is primarily opinion-based.
When I look into the helper() implementation, I note this pattern:
if self.cache.contains_key(&(i, j)) {
return self.cache[&(i, j)].clone();
}
This unnecessarily makes two accesses to the cache. Instead, we can write:
if let Some(value) = self.cache.get(&(i, j)) {
return value.clone();
}
The get()
method returns an Option<&_>
, which we can pattern-match using if let
.
Next, we have the pattern
if let Some(value) = self.cache.get(&(i, j)) {
return value.clone();
}
let value = ...;
self.cache.insert(&(i, j), value.clone());
value
Again, the get()
and insert()
accesses do duplicate work. Typically, we can avoid this using the HashMap Entry API. An entry gives access to a slot in the hash map which may be occupied or may be vacant, and we can efficiently insert a value. The inserting function will return a reference to the value in either case, which we can clone:
self.cache.entry((i, j)).or_insert_with(|| ... ).clone()
which is very elegant except that this doesn't work here: the entry() keeps a &mut
reference to the self
Helper so that it can insert an item, but the closure also wants to borrow self
. For that reason, it makes sense stick with your general design – but note that this could be avoided by using a different type for the cache, e.g. a Vec instead of a HashMap. A Vec or slice can be split_at()
so that different sub-ranges can be borrowed separately. The problem as a whole is also amenable to using Dynamic Programming instead of memoization: without recursion, we wouldn't have to hold on to the cache for the duration of the recursive calls.
This is a common theme with Rust: things seem like they should be easy, but they're not that easy when thinking about the actual ownership or safety issues.
For example, I was thinking about returning a &[u8]
from the helper() function to avoid unnecessary copies. This mostly works and saves a wasted allocation in the tmp.append(...)
statement, but will ultimately have little effect since the max_by_key(...)
statement has two calls into helper(). If their return value were to borrow the Helper struct, they would conflict with each other. In this particular case, it's also not possible to break the &mut
requirement unless we put the cache into a RefCell
and guarantee that the vec contents won't be moved, which we could do with a Pin
and a bit of unsafe
code.
In order to avoid excessive Vec copies, an alternative to references would be to track indexes into an array. This is a common strategy in Rust to avoid lifetime issues. Instead of
struct Helper {
...
cache: HashMap<(usize, usize), Vec<u8>>,
}
we might use:
struct Helper {
...
cache: HashMap<(usize, usize), usize>,
subseqs: Vec<Vec<u8>>,
}
Then:
fn longest_common_subsequence(s1: Vec<u8>, s2: Vec<u8>) -> Vec<u8> {
let mut helper = Helper {
s1,
s2,
cache: HashMap::new(),
subseqs: vec![Vec::new()],
};
let result_index = helper.cached_value_for(0, 0);
helper.subseqs[result_index].clone()
}
impl Helper {
fn cached_value_for(&mut self, i: usize, j: usize) -> usize {
if let Some(&value) = self.cache.get(&(i, j)) {
return value;
}
let value = self.computed_value_for(i, j);
self.cache.insert((i, j), value);
value
}
fn computed_value_for(&mut self, i: usize, j: usize) -> usize {
if i >= self.s1.len() || j >= self.s2.len() {
return 0;
}
if self.s1[i] == self.s2[j] {
let next_index = self.cached_value_for(i + 1, j + 1);
let mut tmp = vec![self.s1[i]];
tmp.extend_from_slice(&self.subseqs[next_index]);
self.subseqs.push(tmp);
return self.subseqs.len() - 1;
}
max_by_key(
self.cached_value_for(i + 1, j),
self.cached_value_for(i, j + 1),
|&s| self.subseqs[s].len(),
)
}
}
The above code also separates the value construction from cache management. If desired, we could use this to eliminate the recursion by priming the cache in the correct pattern:
for i in (0..s1.len()).rev() {
for j in (0..s2.len()).rev() {
cache.insert((i, j), ...);
}
}
A disadvantage of using indices instead of references is that we lose type safety: all those usize
values look the same. This can be often be adressed with crates like typed-arena
.