Metrics for evaluating alignment between timestamped text and audio #383

avishkarsaha · 2024-07-30T16:11:55Z

avishkarsaha
Jul 30, 2024

Say I have some text with timestamps at the word/segment level, and the corresponding audio of the speech. Are there any methods within this repository which might help with evaluating how well aligned the word/segment level timestamps are with the audio?

Answered by jianfch

Aug 1, 2024

And in general there is no score to measure the accuracy of the alignment when there are no ground truth timestamps right

No, not that I know of. To reliably measure accuracy, implies that you have accurate data to measure it against.

View full answer

jianfch · 2024-07-31T00:24:33Z

jianfch
Jul 31, 2024
Maintainer

It would be tricky to evaluate the timestamps if the transcription is not perfect. But assuming you used align() and accounted for the text normalization (i.e. the ground truth and the result both have the same words), you can treat it as a 1D segmentation problem and evaluate timestamps with F1. You can try something like this:

import numpy as np

def compute_f1(pred_starts, pred_ends, gt_starts, gt_ends):
    tp = np.clip(np.where(pred_ends < gt_ends, pred_ends, gt_ends) - np.where(pred_starts > gt_starts, pred_starts, gt_starts), 0, None)
    fp = np.clip(gt_starts - pred_starts, 0, None) + np.clip(pred_ends - gt_ends, 0, None)
    fn = np.clip(pred_starts - gt_starts, 0, None) + np.clip(gt_ends - pred_ends, 0, None)
    return (2*tp)/(2*tp + fp + fn)

word_level = True

# we'll make synthetic data for this example
# lets treat this result as the ground truth
result = model.transcribe('audio.wav')
parts = result.all_words() if word_level else result.segments
gtimestamps = [(part.start, part.end) for part in parts]
gs, ge = np.array([ts[0] for ts in gtimestamps]), np.array([ts[1] for ts in gtimestamps])

# lets offset the result by 0.05 second then use it as the actual prediction
result.offset_time(0.05)
ptimestamps = [(part.start, part.end) for part in parts]
ps, pe = np.array([ts[0] for ts in ptimestamps]), np.array([ts[1] for ts in ptimestamps])

# label each score with its corresponding text
f1s = compute_f1(ps, pe, gs, ge)
word_timestamp_f1 = [dict(text=(part.word if word_level else part.text), f1=f1) for part, f1 in zip(parts, f1s)]

0 replies

avishkarsaha · 2024-07-31T06:00:38Z

avishkarsaha
Jul 31, 2024
Author

Awesome thanks!

And in general there is no score to measure the accuracy of the alignment when there are no ground truth timestamps right? Just wondering how I could evaluate an alignment between text/audio when there are no ground truth timestamps (I tried using a DTW cost between phonemized audio sequence and phonemized text sequence, but its not 100% reliable).

1 reply

jianfch Aug 1, 2024
Maintainer

And in general there is no score to measure the accuracy of the alignment when there are no ground truth timestamps right

No, not that I know of. To reliably measure accuracy, implies that you have accurate data to measure it against.

Answer selected by avishkarsaha

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics for evaluating alignment between timestamped text and audio #383

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Metrics for evaluating alignment between timestamped text and audio #383

avishkarsaha Jul 30, 2024

Replies: 2 comments · 1 reply

jianfch Jul 31, 2024 Maintainer

avishkarsaha Jul 31, 2024 Author

jianfch Aug 1, 2024 Maintainer

avishkarsaha
Jul 30, 2024

Replies: 2 comments 1 reply

jianfch
Jul 31, 2024
Maintainer

avishkarsaha
Jul 31, 2024
Author

jianfch Aug 1, 2024
Maintainer