levenshtein distance #927

vincentqb · 2020-09-29T23:19:46Z

This PR takes the levenshtein distance from #632 and moves it to torchaudio. This also adds the docstring from vincentqb#3.

cc notebook

cpuhrsch · 2020-09-30T13:35:19Z

test/torchaudio_unittest/metrics.py

+
+
+class TestLevenshteinDistance(common_utils.TorchaudioTestCase):
+    @parameterized.expand(


I'd add edge cases such as ["abc", "", 3] or ["", "", 0] as well

cpuhrsch · 2020-09-30T13:36:44Z

torchaudio/metrics.py



-def levenshtein_distance(r: Union[str, List[str]], h: Union[str, List[str]]):
+def levenshtein_distance(r: Union[str, List[str]], h: Union[str, List[str]]) -> int:


From a minimalism perspective, I can see that List[str] arguments can be useful, but aside from syntactic sugar, what else do they offer users?

Also, can you mix List[str] and str?

The function runs with any two arbitrary sequences. I updated this. That being said, a user would only really using this for two Sequence[T] for the same T.

relate to comment.

cpuhrsch · 2020-09-30T13:37:11Z

test/torchaudio_unittest/metrics.py

+            ["aa", "aaa", 1],
+            ["aaa", "aa", 1],
+            ["abc", "bcd", 2],
+            [["hello", "world"], ["hello", "world", "!"], 1],


What would [["hello", "world"], "world"] do?

It would said that ["hello", "world"] is not the same as "w", and then "o" "r" "d" will need to be added, so the edit distance would be 1 replacement + 4 additions, so 5 edits. This is a particular case of comment.

Added as a test.

We could decide to detect such cases and compute the edit distance between the word, and each of the word, and return a list of edit distance instead.

This then lead to the question: what API do we expect a metric to offer, and how would this one align?

If we go that path, then we'd treat as special cases:

[Sequence[T]], Sequence[T]]]

[Sequence[T], List[Sequence[T]]]

[List[Sequence[T]], Sequence[T]]

[List[Sequence[T]], List[Sequence[T]]]

The last is ambiguous: batch compare 1-1, compare all possible pairs, or edit distance between the two lists (e.g. comparing sentences)?

The two current use cases are

two strings

two sentences (list of strings)

Thoughts?

I'd stick to "two strings" and add the other behavior later on, if necessary. I know you already use it in your example, but how much worse does the code become?

I'm not sure I follow your suggestiong: are you saying the function only handles a reference and hypothesis string: levenshtein_distance("Hello", "ello")?

How do you suggest comparing the edit distance between the sentences "Hello World!" and "Bonjour World!"? Right now, the sentence is first split into by spaces, and the distance function is then applied on the two lists of words: levenshtein_distance(["Hello", "World!"], ["Bonjour", "World!"]).

cpuhrsch · 2020-09-30T13:38:43Z

torchaudio/metrics.py

    Calculate the Levenshtein distance between two lists or strings.
+
+    The function computes an edit distance allowing deletion, insertion and substitution.
+    The result is an integer. Users may want to normalize by the length of the reference.


This should imo also include an explanation of what happens when the user passes a list of strings and how that is different.

Agree, and updated. Doc may need to be updated further following discussion from comment.

mthrok · 2020-10-01T15:32:56Z

torchaudio/metrics.py

@@ -0,0 +1,40 @@
+from collections.abc import Sequence


Does this really deserve to be in a new namespace, for user-side API perspective?
I could also see that this can live inside of torchaudio.functionals.

Splitting the implementation to the dedicated module makes sense. functionals.py grew too large in my opinion, but we can also import functions defined in this module in functionals.py so that users can access it via torchaudio.functionals

That's a very good question. We'll need to follow-up on where this should go. functional.py does seem like a good candidate.

fix position of imports.

vincentqb · 2021-06-24T22:08:03Z

Closed by #1601

vincentqb requested a review from cpuhrsch September 29, 2020 23:20

cpuhrsch changed the title ~~add levenshtein distance to torchaudio from pipeline example~~ levenshtein distance Sep 30, 2020

cpuhrsch reviewed Sep 30, 2020

View reviewed changes

vincentqb force-pushed the levenshtein branch 3 times, most recently from b6ea6b7 to ec889c1 Compare September 30, 2020 16:31

mthrok reviewed Oct 1, 2020

View reviewed changes

vincentqb added 2 commits October 7, 2020 16:28

add levenshtein distance to torchaudio from pipeline example.

0cc5d8c

fix position of imports.

adding edge cases. more general type.

1e97f92

vincentqb force-pushed the levenshtein branch from 02010f8 to 620b516 Compare October 7, 2020 20:28

move out of metrics.

f401713

vincentqb force-pushed the levenshtein branch from 620b516 to f401713 Compare October 7, 2020 20:29

vincentqb added 3 commits October 7, 2020 16:33

torchscript test.

5186fd0

torchscript generator.

4f509f1

torchscript types.

190f288

vincentqb force-pushed the levenshtein branch from b884c2b to 190f288 Compare October 8, 2020 15:58

facebook-github-bot added the CLA Signed label Oct 30, 2020

yangarbiter mentioned this pull request Jun 21, 2021

Add edit distance from example to torchaudio.functional #1601

Merged

vincentqb closed this Jun 24, 2021



		class TestLevenshteinDistance(common_utils.TorchaudioTestCase):
		@parameterized.expand(



		def levenshtein_distance(r: Union[str, List[str]], h: Union[str, List[str]]):
		def levenshtein_distance(r: Union[str, List[str]], h: Union[str, List[str]]) -> int:

levenshtein distance #927

levenshtein distance #927

Uh oh!

Conversation

vincentqb commented Sep 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Sep 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb Oct 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vincentqb commented Jun 24, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vincentqb commented Sep 29, 2020 •

edited

Loading

vincentqb Sep 30, 2020 •

edited

Loading

vincentqb Sep 30, 2020 •

edited

Loading

vincentqb Oct 2, 2020 •

edited

Loading