-
Notifications
You must be signed in to change notification settings - Fork 736
levenshtein distance #927
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
levenshtein distance #927
Conversation
|
|
||
|
|
||
| class TestLevenshteinDistance(common_utils.TorchaudioTestCase): | ||
| @parameterized.expand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add edge cases such as ["abc", "", 3] or ["", "", 0] as well
torchaudio/metrics.py
Outdated
|
|
||
|
|
||
| def levenshtein_distance(r: Union[str, List[str]], h: Union[str, List[str]]): | ||
| def levenshtein_distance(r: Union[str, List[str]], h: Union[str, List[str]]) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a minimalism perspective, I can see that List[str] arguments can be useful, but aside from syntactic sugar, what else do they offer users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can you mix List[str] and str?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function runs with any two arbitrary sequences. I updated this. That being said, a user would only really using this for two Sequence[T] for the same T.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
relate to comment.
| ["aa", "aaa", 1], | ||
| ["aaa", "aa", 1], | ||
| ["abc", "bcd", 2], | ||
| [["hello", "world"], ["hello", "world", "!"], 1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would [["hello", "world"], "world"] do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would said that ["hello", "world"] is not the same as "w", and then "o" "r" "d" will need to be added, so the edit distance would be 1 replacement + 4 additions, so 5 edits. This is a particular case of comment.
Added as a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could decide to detect such cases and compute the edit distance between the word, and each of the word, and return a list of edit distance instead.
This then lead to the question: what API do we expect a metric to offer, and how would this one align?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we go that path, then we'd treat as special cases:
[Sequence[T]], Sequence[T]]][Sequence[T], List[Sequence[T]]][List[Sequence[T]], Sequence[T]][List[Sequence[T]], List[Sequence[T]]]
The last is ambiguous: batch compare 1-1, compare all possible pairs, or edit distance between the two lists (e.g. comparing sentences)?
The two current use cases are
- two strings
- two sentences (list of strings)
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd stick to "two strings" and add the other behavior later on, if necessary. I know you already use it in your example, but how much worse does the code become?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow your suggestiong: are you saying the function only handles a reference and hypothesis string: levenshtein_distance("Hello", "ello")?
How do you suggest comparing the edit distance between the sentences "Hello World!" and "Bonjour World!"? Right now, the sentence is first split into by spaces, and the distance function is then applied on the two lists of words: levenshtein_distance(["Hello", "World!"], ["Bonjour", "World!"]).
torchaudio/metrics.py
Outdated
| Calculate the Levenshtein distance between two lists or strings. | ||
| The function computes an edit distance allowing deletion, insertion and substitution. | ||
| The result is an integer. Users may want to normalize by the length of the reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should imo also include an explanation of what happens when the user passes a list of strings and how that is different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, and updated. Doc may need to be updated further following discussion from comment.
b6ea6b7 to
ec889c1
Compare
torchaudio/metrics.py
Outdated
| @@ -0,0 +1,40 @@ | |||
| from collections.abc import Sequence | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this really deserve to be in a new namespace, for user-side API perspective?
I could also see that this can live inside of torchaudio.functionals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splitting the implementation to the dedicated module makes sense. functionals.py grew too large in my opinion, but we can also import functions defined in this module in functionals.py so that users can access it via torchaudio.functionals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good question. We'll need to follow-up on where this should go. functional.py does seem like a good candidate.
fix position of imports.
|
Closed by #1601 |
This PR takes the levenshtein distance from #632 and moves it to torchaudio. This also adds the docstring from vincentqb#3.
cc notebook