Scripting Tcl 大小事: Edit Distance

You are given two strings $S1 and $S2. Write a script to find out the minimum operations required to convert $S1 into $S2. The operations can be 'insert', 'remove' or 'replace' a character.

可以參考維基百科 Edit distance 的解釋。這裡使用最簡單（但是執行效率不佳）的遞迴版本實作。

#!/usr/bin/env tclsh
#
# You are given two strings $S1 and $S2.
# Write a script to find out the minimum operations required to 
# convert $S1 into $S2. The operations can be 'insert', 'remove' 
# or 'replace' a character.
#
proc editdistance {str1 str2 m n} {
    if {$m == 0} {
        return $n
    }

    if {$n == 0} {
        return $m
    }

    set mvalue [tcl::mathop::- $m 1]
    set nvalue [tcl::mathop::- $n 1]

    # If last characters of two strings are same, nothing
    # much to do. Ignore and count for remaining strings.
    if {[string index $str1 $mvalue] == [string index $str2 $nvalue]} {
        return [editdistance $str1 $str2 $mvalue $nvalue]
    }

    # To insert, remove or replace operation
    return [tcl::mathop::+ 1 [tcl::mathfunc::min \
                   [editdistance $str1 $str2 $m $nvalue] \
                   [editdistance $str1 $str2 $mvalue $n] \
                   [editdistance $str1 $str2 $mvalue $nvalue]]]
}

if {$argc != 2} {
    exit
}

set s1 [lindex $argv 0]
set s2 [lindex $argv 1]

set r [editdistance $s1 $s2 [string length $s1] [string length $s2]]
puts "Output: $r"

如果觀察程式的運作，可以注意到，會一再重覆計算一些已經計算過的項目。因此一個改進的方向是 dynamic programming，運用陣列記住已經算過的值，這樣就可以避免重複計算，讓程式計算速度更快。所以建立一個 mxn 的陣列，並且記錄已經計算過的值，最後 dp(m,n) 就是答案。

#!/usr/bin/env tclsh
#
# You are given two strings $S1 and $S2.
# Write a script to find out the minimum operations required to 
# convert $S1 into $S2. The operations can be 'insert', 'remove' 
# or 'replace' a character.
#
proc editdistance {str1 str2 m n} {
    array set dp {}

    for {set i 0} {$i <= $m} {incr i} {
        for {set j 0} {$j <= $n} {incr j} {
            set ivalue [tcl::mathop::- $i 1]
            set jvalue [tcl::mathop::- $j 1]

            if {$i == 0} {
                set dp($i,$j) $j
            } elseif {$j == 0} {
                set dp($i,$j) $i
            } elseif {[string index $str1 $ivalue] == [string index $str2 $jvalue]} {
                set dp($i,$j) $dp($ivalue,$jvalue)
            } else {
                set dp($i,$j) [tcl::mathop::+ 1 [tcl::mathfunc::min \
                   $dp($i,$jvalue) $dp($ivalue,$j) $dp($ivalue,$jvalue)]]
            }
        }
    }

    return $dp($m,$n)
}

if {$argc != 2} {
    exit
}

set s1 [lindex $argv 0]
set s2 [lindex $argv 1]

set r [editdistance $s1 $s2 [string length $s1] [string length $s2]]
puts "Output: $r"

但是上面的解法如果是比較大的字串，就會需要很多空間儲存。觀察程式的計算，可以發現當我們在計算第二列的時候，只需要第一列的結果（以此類推）。所以我們可以使用一個 2xm 的陣列來計算。

#!/usr/bin/env tclsh
#
# You are given two strings $S1 and $S2.
# Write a script to find out the minimum operations required to 
# convert $S1 into $S2. The operations can be 'insert', 'remove' 
# or 'replace' a character.
#
proc editdistance {str1 str2 m n} {
    array set dp {}

    # Base condition when second string is empty
    for {set i 0} {$i <= $m} {incr i} {
        set dp(0,$i) $i
    }

    for {set i 1} {$i <= $n} {incr i} {
        for {set j 0} {$j <= $m} {incr j} {
            set ivalue [tcl::mathop::- $i 1]
            set jvalue [tcl::mathop::- $j 1]

            if {$j == 0} {
                set dp([tcl::mathop::% $i 2],$j) $i
            } elseif {[string index $str1 $jvalue] == [string index $str2 $ivalue]} {
                set dp([tcl::mathop::% $i 2],$j) $dp([tcl::mathop::% $ivalue 2],$jvalue)
            } else {
                set dp([tcl::mathop::% $i 2],$j) \
                    [tcl::mathop::+ 1 [tcl::mathfunc::min \
                        $dp([tcl::mathop::% $i 2],$jvalue) \
                        $dp([tcl::mathop::% $ivalue 2],$j) \
                        $dp([tcl::mathop::% $ivalue 2],$jvalue)]]
            }
        }
    }

    return $dp([tcl::mathop::% $n 2],$m)
}

if {$argc != 2} {
    exit
}

set s1 [lindex $argv 0]
set s2 [lindex $argv 1]

set r [editdistance $s1 $s2 [string length $s1] [string length $s2]]
puts "Output: $r"

Scripting Tcl 大小事

2021-01-24

Edit Distance

沒有留言: