Metrics for evaluating alignment between timestamped text and audio #383
-
Say I have some text with timestamps at the word/segment level, and the corresponding audio of the speech. Are there any methods within this repository which might help with evaluating how well aligned the word/segment level timestamps are with the audio? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
It would be tricky to evaluate the timestamps if the transcription is not perfect. But assuming you used import numpy as np
def compute_f1(pred_starts, pred_ends, gt_starts, gt_ends):
tp = np.clip(np.where(pred_ends < gt_ends, pred_ends, gt_ends) - np.where(pred_starts > gt_starts, pred_starts, gt_starts), 0, None)
fp = np.clip(gt_starts - pred_starts, 0, None) + np.clip(pred_ends - gt_ends, 0, None)
fn = np.clip(pred_starts - gt_starts, 0, None) + np.clip(gt_ends - pred_ends, 0, None)
return (2*tp)/(2*tp + fp + fn)
word_level = True
# we'll make synthetic data for this example
# lets treat this result as the ground truth
result = model.transcribe('audio.wav')
parts = result.all_words() if word_level else result.segments
gtimestamps = [(part.start, part.end) for part in parts]
gs, ge = np.array([ts[0] for ts in gtimestamps]), np.array([ts[1] for ts in gtimestamps])
# lets offset the result by 0.05 second then use it as the actual prediction
result.offset_time(0.05)
ptimestamps = [(part.start, part.end) for part in parts]
ps, pe = np.array([ts[0] for ts in ptimestamps]), np.array([ts[1] for ts in ptimestamps])
# label each score with its corresponding text
f1s = compute_f1(ps, pe, gs, ge)
word_timestamp_f1 = [dict(text=(part.word if word_level else part.text), f1=f1) for part, f1 in zip(parts, f1s)] |
Beta Was this translation helpful? Give feedback.
-
Awesome thanks! And in general there is no score to measure the accuracy of the alignment when there are no ground truth timestamps right? Just wondering how I could evaluate an alignment between text/audio when there are no ground truth timestamps (I tried using a DTW cost between phonemized audio sequence and phonemized text sequence, but its not 100% reliable). |
Beta Was this translation helpful? Give feedback.
No, not that I know of. To reliably measure accuracy, implies that you have accurate data to measure it against.