You are given two strings $S1 and $S2.
Write a script to find out the minimum operations required to convert $S1 into $S2.
The operations can be 'insert', 'remove' or 'replace' a character.
可以參考維基百科 Edit distance 的解釋。
這裡使用最簡單(但是執行效率不佳)的遞迴版本實作。
#!/usr/bin/env tclsh
#
# You are given two strings $S1 and $S2.
# Write a script to find out the minimum operations required to
# convert $S1 into $S2. The operations can be 'insert', 'remove'
# or 'replace' a character.
#
proc editdistance {str1 str2 m n} {
if {$m == 0} {
return $n
}
if {$n == 0} {
return $m
}
set mvalue [tcl::mathop::- $m 1]
set nvalue [tcl::mathop::- $n 1]
# If last characters of two strings are same, nothing
# much to do. Ignore and count for remaining strings.
if {[string index $str1 $mvalue] == [string index $str2 $nvalue]} {
return [editdistance $str1 $str2 $mvalue $nvalue]
}
# To insert, remove or replace operation
return [tcl::mathop::+ 1 [tcl::mathfunc::min \
[editdistance $str1 $str2 $m $nvalue] \
[editdistance $str1 $str2 $mvalue $n] \
[editdistance $str1 $str2 $mvalue $nvalue]]]
}
if {$argc != 2} {
exit
}
set s1 [lindex $argv 0]
set s2 [lindex $argv 1]
set r [editdistance $s1 $s2 [string length $s1] [string length $s2]]
puts "Output: $r"
如果觀察程式的運作,可以注意到,會一再重覆計算一些已經計算過的項目。 因此一個改進的方向是 dynamic programming,運用陣列記住已經算過的值,這樣就可以避免重複計算, 讓程式計算速度更快。所以建立一個 mxn 的陣列,並且記錄已經計算過的值,最後 dp(m,n) 就是答案。
#!/usr/bin/env tclsh
#
# You are given two strings $S1 and $S2.
# Write a script to find out the minimum operations required to
# convert $S1 into $S2. The operations can be 'insert', 'remove'
# or 'replace' a character.
#
proc editdistance {str1 str2 m n} {
array set dp {}
for {set i 0} {$i <= $m} {incr i} {
for {set j 0} {$j <= $n} {incr j} {
set ivalue [tcl::mathop::- $i 1]
set jvalue [tcl::mathop::- $j 1]
if {$i == 0} {
set dp($i,$j) $j
} elseif {$j == 0} {
set dp($i,$j) $i
} elseif {[string index $str1 $ivalue] == [string index $str2 $jvalue]} {
set dp($i,$j) $dp($ivalue,$jvalue)
} else {
set dp($i,$j) [tcl::mathop::+ 1 [tcl::mathfunc::min \
$dp($i,$jvalue) $dp($ivalue,$j) $dp($ivalue,$jvalue)]]
}
}
}
return $dp($m,$n)
}
if {$argc != 2} {
exit
}
set s1 [lindex $argv 0]
set s2 [lindex $argv 1]
set r [editdistance $s1 $s2 [string length $s1] [string length $s2]]
puts "Output: $r"
但是上面的解法如果是比較大的字串,就會需要很多空間儲存。觀察程式的計算,
可以發現當我們在計算第二列的時候,只需要第一列的結果(以此類推)。
所以我們可以使用一個 2xm 的陣列來計算。
#!/usr/bin/env tclsh
#
# You are given two strings $S1 and $S2.
# Write a script to find out the minimum operations required to
# convert $S1 into $S2. The operations can be 'insert', 'remove'
# or 'replace' a character.
#
proc editdistance {str1 str2 m n} {
array set dp {}
# Base condition when second string is empty
for {set i 0} {$i <= $m} {incr i} {
set dp(0,$i) $i
}
for {set i 1} {$i <= $n} {incr i} {
for {set j 0} {$j <= $m} {incr j} {
set ivalue [tcl::mathop::- $i 1]
set jvalue [tcl::mathop::- $j 1]
if {$j == 0} {
set dp([tcl::mathop::% $i 2],$j) $i
} elseif {[string index $str1 $jvalue] == [string index $str2 $ivalue]} {
set dp([tcl::mathop::% $i 2],$j) $dp([tcl::mathop::% $ivalue 2],$jvalue)
} else {
set dp([tcl::mathop::% $i 2],$j) \
[tcl::mathop::+ 1 [tcl::mathfunc::min \
$dp([tcl::mathop::% $i 2],$jvalue) \
$dp([tcl::mathop::% $ivalue 2],$j) \
$dp([tcl::mathop::% $ivalue 2],$jvalue)]]
}
}
}
return $dp([tcl::mathop::% $n 2],$m)
}
if {$argc != 2} {
exit
}
set s1 [lindex $argv 0]
set s2 [lindex $argv 1]
set r [editdistance $s1 $s2 [string length $s1] [string length $s2]]
puts "Output: $r"