-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TF get_sprint_automata_for_batch: RASR segmentation fault in Speech::CTCTopologyGraphBuilder::addLoopTransition
#1456
Comments
Ah, that's just in But it means there was another actual exception happening before. Can you post the full log? |
Sure, the full log is here:
See also in I created script to reproduce the error: |
We encountered this bug and there is a patch for it. Daniel wanted to do a
PR.
…On Wed, Nov 8, 2023, 12:25 vieting ***@***.***> wrote:
Sure, the full log is here:
RETURNN starting up, version 1.20231107.125810+git.dbef0ca0, date/time 2023-11-08-12-17-46 (UTC+0100), pid 1212279, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3
RETURNN command line options: ['returnn.config']
Hostname: cn-04
TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) (<not-under-git> in /usr/local/lib/python3.8/dist-packages/tensorflow)
Use num_threads=1 (but min 2) via OMP_NUM_THREADS.
Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}.
CUDA_VISIBLE_DEVICES is not set.
Collecting TensorFlow device list...
Local devices available to TensorFlow:
1/1: name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 3855380559335333431
xla_global_id: -1
Train data:
input: 1 x 1
output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]}
OggZipDataset, sequences: 249229, frames: unknown
Dev data:
OggZipDataset, sequences: 300, frames: unknown
RETURNN starting up, version 1.20231107.125810+git.dbef0ca0, date/time 2023-11-08-12-18-11 (UTC+0100), pid 3325131, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3
RETURNN command line options: ['returnn.config']
Hostname: cn-285
TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) (<not-under-git> in /usr/local/lib/python3.8/dist-packages/tensorflow)
Use num_threads=1 (but min 2) via OMP_NUM_THREADS.
Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}.
CUDA_VISIBLE_DEVICES is set to '2'.
Collecting TensorFlow device list...
Local devices available to TensorFlow:
1/2: name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 7046766875533982763
xla_global_id: -1
2/2: name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 10089005056
locality {
bus_id: 1
links {
}
}
incarnation: 14158601620701111509
physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:41:00.0, compute capability: 7.5"
xla_global_id: 416903419
Using gpu device 2: NVIDIA GeForce RTX 2080 Ti
Hostname 'cn-285', GPU 2, GPU-dev-name 'NVIDIA GeForce RTX 2080 Ti', GPU-memory 9.4GB
Train data:
input: 1 x 1
output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]}
OggZipDataset, sequences: 249229, frames: unknown
Dev data:
OggZipDataset, sequences: 300, frames: unknown
Learning-rate-control: file learning_rates.swb.ctc does not exist yet
Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ...
layer /'data': [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)] float32
layer /features/'conv_h_filter': ['conv_h_filter:static:0'(128),'conv_h_filter:static:1'(1),F|F'conv_h_filter:static:2'(150)] float32
layer /features/'conv_h': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32
layer /features/'conv_h_act': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32
layer /features/'conv_h_split': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F'conv_h:channel'(150),F|F'conv_h_split_split_dims1'(1)] float32
DEPRECATION WARNING: Explicitly specify in_spatial_dims when there is more than one spatial dim in the input.
This will be disallowed with behavior_version 8.
layer /features/'conv_l': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel'(150),F|F'conv_l:channel'(5)] float32
layer /features/'conv_l_merge': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32
DEPRECATION WARNING: MergeDimsLayer, only keep_order=True is allowed
This will be disallowed with behavior_version 6.
layer /features/'conv_l_act_no_norm': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32
layer /features/'conv_l_act': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32
layer /features/'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32
layer /'features': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32
layer /'specaug': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32
layer /'conv_source': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_source_split_dims1'(1)] float32
layer /'conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_1:channel'(32)] float32
layer /'conv_1_pool': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_1:channel'(32)] float32
layer /'conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/32⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_2:channel'(64)] float32
layer /'conv_3': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_3:channel'(64)] float32
layer /'conv_merged': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conv_h:channel*conv_l:channel//2)*conv_3:channel'(24000)] float32
layer /'input_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32
layer /'input_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32
layer /'conformer_1_ffmod_1_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32
layer /'conformer_1_ffmod_1_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_linear_swish:feature-dense'(2048)] float32
layer /'conformer_1_ffmod_1_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32
layer /'conformer_1_ffmod_1_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32
layer /'conformer_1_ffmod_1_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32
layer /'conformer_1_conv_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32
layer /'conformer_1_conv_mod_pointwise_conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_pointwise_conv_1:feature-dense'(1024)] float32
layer /'conformer_1_conv_mod_glu': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conformer_1_conv_mod_pointwise_conv_1:feature-dense)//2'(512)] float32
layer /'conformer_1_conv_mod_depthwise_conv': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32
layer /'conformer_1_conv_mod_bn': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32
DEPRECATION WARNING: batch_norm masked_time should be specified explicitly
This will be disallowed with behavior_version 12.
layer /'conformer_1_conv_mod_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32
layer /'conformer_1_conv_mod_pointwise_conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32
layer /'conformer_1_conv_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32
layer /'conformer_1_conv_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32
layer /'conformer_1_mhsa_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32
layer /'conformer_1_mhsa_mod_relpos_encoding': [T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_relpos_encoding_rel_pos_enc_feat'(64)] float32
layer /'conformer_1_mhsa_mod_self_attention': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32
layer /'conformer_1_mhsa_mod_att_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32
layer /'conformer_1_mhsa_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32
layer /'conformer_1_mhsa_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32
layer /'conformer_1_ffmod_2_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32
layer /'conformer_1_ffmod_2_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_linear_swish:feature-dense'(2048)] float32
layer /'conformer_1_ffmod_2_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32
layer /'conformer_1_ffmod_2_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32
layer /'conformer_1_ffmod_2_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32
layer /'conformer_1_output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32
layer /'encoder': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32
layer /'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'output:feature-dense'(88)] float32
Network layer topology:
extern data: data: Tensor{[B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]}, seq_tag: Tensor{[B?], dtype='string'}
used data keys: ['data', 'seq_tag']
layers:
layer batch_norm 'conformer_1_conv_mod_bn' #: 512
layer conv 'conformer_1_conv_mod_depthwise_conv' #: 512
layer copy 'conformer_1_conv_mod_dropout' #: 512
layer gating 'conformer_1_conv_mod_glu' #: 512
layer layer_norm 'conformer_1_conv_mod_ln' #: 512
layer linear 'conformer_1_conv_mod_pointwise_conv_1' #: 1024
layer linear 'conformer_1_conv_mod_pointwise_conv_2' #: 512
layer combine 'conformer_1_conv_mod_res_add' #: 512
layer activation 'conformer_1_conv_mod_swish' #: 512
layer copy 'conformer_1_ffmod_1_dropout' #: 512
layer linear 'conformer_1_ffmod_1_dropout_linear' #: 512
layer eval 'conformer_1_ffmod_1_half_res_add' #: 512
layer linear 'conformer_1_ffmod_1_linear_swish' #: 2048
layer layer_norm 'conformer_1_ffmod_1_ln' #: 512
layer copy 'conformer_1_ffmod_2_dropout' #: 512
layer linear 'conformer_1_ffmod_2_dropout_linear' #: 512
layer eval 'conformer_1_ffmod_2_half_res_add' #: 512
layer linear 'conformer_1_ffmod_2_linear_swish' #: 2048
layer layer_norm 'conformer_1_ffmod_2_ln' #: 512
layer linear 'conformer_1_mhsa_mod_att_linear' #: 512
layer copy 'conformer_1_mhsa_mod_dropout' #: 512
layer layer_norm 'conformer_1_mhsa_mod_ln' #: 512
layer relative_positional_encoding 'conformer_1_mhsa_mod_relpos_encoding' #: 64
layer combine 'conformer_1_mhsa_mod_res_add' #: 512
layer self_attention 'conformer_1_mhsa_mod_self_attention' #: 512
layer layer_norm 'conformer_1_output' #: 512
layer conv 'conv_1' #: 32
layer pool 'conv_1_pool' #: 32
layer conv 'conv_2' #: 64
layer conv 'conv_3' #: 64
layer merge_dims 'conv_merged' #: 24000
layer split_dims 'conv_source' #: 1
layer source 'data' #: 1
layer copy 'encoder' #: 512
layer subnetwork 'features' #: 750
layer conv 'features/conv_h' #: 150
layer eval 'features/conv_h_act' #: 150
layer variable 'features/conv_h_filter' #: 150
layer split_dims 'features/conv_h_split' #: 1
layer conv 'features/conv_l' #: 5
layer layer_norm 'features/conv_l_act' #: 750
layer eval 'features/conv_l_act_no_norm' #: 750
layer merge_dims 'features/conv_l_merge' #: 750
layer copy 'features/output' #: 750
layer copy 'input_dropout' #: 512
layer linear 'input_linear' #: 512
layer softmax 'output' #: 88
layer eval 'specaug' #: 750
net params #: 18473980
net trainable params: [<tf.Variable 'conformer_1_conv_mod_bn/batch_norm/conformer_1_conv_mod_bn_conformer_1_conv_mod_bn_output_beta:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_bn/batch_norm/conformer_1_conv_mod_bn_conformer_1_conv_mod_bn_output_gamma:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_depthwise_conv/W:0' shape=(32, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_depthwise_conv/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_1/W:0' shape=(512, 1024) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_1/b:0' shape=(1024,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_2/W:0' shape=(512, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_2/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_dropout_linear/W:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_dropout_linear/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_linear_swish/W:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_linear_swish/b:0' shape=(2048,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_dropout_linear/W:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_dropout_linear/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_linear_swish/W:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_linear_swish/b:0' shape=(2048,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_att_linear/W:0' shape=(512, 512) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_relpos_encoding/encoding_matrix:0' shape=(65, 64) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_self_attention/QKV:0' shape=(512, 1536) dtype=float32>, <tf.Variable 'conformer_1_output/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_output/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conv_1/W:0' shape=(3, 3, 1, 32) dtype=float32>, <tf.Variable 'conv_1/bias:0' shape=(32,) dtype=float32>, <tf.Variable 'conv_2/W:0' shape=(3, 3, 32, 64) dtype=float32>, <tf.Variable 'conv_2/bias:0' shape=(64,) dtype=float32>, <tf.Variable 'conv_3/W:0' shape=(3, 3, 64, 64) dtype=float32>, <tf.Variable 'conv_3/bias:0' shape=(64,) dtype=float32>, <tf.Variable 'features/conv_h_filter/conv_h_filter:0' shape=(128, 1, 150) dtype=float32>, <tf.Variable 'features/conv_l/W:0' shape=(40, 1, 1, 5) dtype=float32>, <tf.Variable 'features/conv_l_act/bias:0' shape=(750,) dtype=float32>, <tf.Variable 'features/conv_l_act/scale:0' shape=(750,) dtype=float32>, <tf.Variable 'input_linear/W:0' shape=(24000, 512) dtype=float32>, <tf.Variable 'output/W:0' shape=(512, 88) dtype=float32>, <tf.Variable 'output/b:0' shape=(88,) dtype=float32>]
start training at epoch 1
using batch size: {'classes': 5000, 'data': 400000}, max seqs: 128
learning rate control: NewbobMultiEpoch(num_epochs=6, update_interval=1, relative_error_threshold=-0.01, relative_error_grow_threshold=-0.01), epoch data: 1: EpochData(learningRate=1.325e-05, error={}), 2: EpochData(learningRate=1.539861111111111e-05, error={}), 3: EpochData(learningRate=1.754722222222222e-05, error={}), ..., 360: EpochData(learningRate=1.4333333333333375e-05, error={}), 361: EpochData(learningRate=1.2166666666666727e-05, error={}), 362: EpochData(learningRate=1e-05, error={}), error key: None
pretrain: None
start epoch 1 with learning rate 1.325e-05 ...
TF: log_dir: output/models/train-2023-11-08-11-18-11
Create optimizer <class 'returnn.tf.updater.NadamOptimizer'> with options {'epsilon': 1e-08, 'learning_rate': <tf.Variable 'learning_rate:0' shape=() dtype=float32>}.
Initialize optimizer (default) with slots ['m', 'v'].
These additional variable were created by the optimizer: [<tf.Variable 'optimize/gradients/conformer_1_conv_mod_bn/batch_norm/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_bn/batch_norm/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_depthwise_conv/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(32, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_depthwise_conv/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_1/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 1024) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_1/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(1024,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_2/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_2/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_dropout_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_dropout_linear/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_linear_swish/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_linear_swish/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_dropout_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_dropout_linear/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_linear_swish/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_linear_swish/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_att_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_relpos_encoding/Gather_grad/Reshape_accum_grad/var_accum_grad:0' shape=(65, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_self_attention/dot/MatMul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512, 1536) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_output/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_output/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_1/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 1, 32) dtype=float32>, <tf.Variable 'optimize/gradients/conv_1/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(32,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_2/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 32, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conv_2/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(64,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_3/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 64, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conv_3/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(64,) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_h/convolution/ExpandDims_1_grad/Reshape_accum_grad/var_accum_grad:0' shape=(128, 1, 150) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l/convolution_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(40, 1, 1, 5) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l_act/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(750,) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l_act/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(750,) dtype=float32>, <tf.Variable 'optimize/gradients/input_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(24000, 512) dtype=float32>, <tf.Variable 'optimize/gradients/output/linear/dot/MatMul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512, 88) dtype=float32>, <tf.Variable 'optimize/gradients/output/linear/add_bias_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(88,) dtype=float32>, <tf.Variable 'optimize/apply_grads/accum_grad_multiple_step/beta1_power:0' shape=() dtype=float32>, <tf.Variable 'optimize/apply_grads/accum_grad_multiple_step/beta2_power:0' shape=() dtype=float32>].
SprintSubprocessInstance: exec ['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1']
SprintSubprocessInstance: starting, pid 3325822
SprintSubprocessInstance: Sprint child process (['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1']) caused an exception.
TensorFlow exception: Graph execution error:
Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last):
File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in <module>
main()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main
execute_main_task()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task
engine.init_train_from_config(config, train_data, dev_data, eval_data)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config
self.init_network_from_config(config)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config
self._init_network(net_desc=net_dict, epoch=self.epoch)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network
self.network, self.updater = self.create_network(
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network
updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__
self.loss = network.get_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective
self.maybe_construct_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective
self._construct_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective
losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized
if loss_obj.get_loss_value_for_objective() is not None:
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective
self._prepare()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare
self._loss_value = self.loss.get_value()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value
fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata(
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata
edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op
edges, weights, start_end_states = tf_compat.v1.py_func(
Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch'
Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last):
File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in <module>
main()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main
execute_main_task()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task
engine.init_train_from_config(config, train_data, dev_data, eval_data)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config
self.init_network_from_config(config)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config
self._init_network(net_desc=net_dict, epoch=self.epoch)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network
self.network, self.updater = self.create_network(
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network
updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__
self.loss = network.get_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective
self.maybe_construct_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective
self._construct_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective
losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized
if loss_obj.get_loss_value_for_objective() is not None:
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective
self._prepare()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare
self._loss_value = self.loss.get_value()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value
fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata(
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata
edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op
edges, weights, start_end_states = tf_compat.v1.py_func(
Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch'
2 root error(s) found.
(0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed
Traceback (most recent call last):
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child
ret = self._read()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read
return Unpickler(p).load()
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__
ret = func(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
return func(*args, **kwargs)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch
return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch
edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch
instance = self._get_instance(i)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance
self._maybe_create_new_instance()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance
self.instances.append(SprintSubprocessInstance(**self.sprint_opts))
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__
self.init()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init
self._start_child()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child
raise Exception("SprintSubprocessInstance Sprint init failed")
Exception: SprintSubprocessInstance Sprint init failed
[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]]
[[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]]
(1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed
Traceback (most recent call last):
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child
ret = self._read()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read
return Unpickler(p).load()
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__
ret = func(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
return func(*args, **kwargs)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch
return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch
edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch
instance = self._get_instance(i)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance
self._maybe_create_new_instance()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance
self.instances.append(SprintSubprocessInstance(**self.sprint_opts))
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__
self.init()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init
self._start_child()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child
raise Exception("SprintSubprocessInstance Sprint init failed")
Exception: SprintSubprocessInstance Sprint init failed
[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch':
File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in <module>
main()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main
execute_main_task()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task
engine.init_train_from_config(config, train_data, dev_data, eval_data)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config
self.init_network_from_config(config)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config
self._init_network(net_desc=net_dict, epoch=self.epoch)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network
self.network, self.updater = self.create_network(
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network
updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__
self.loss = network.get_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective
self.maybe_construct_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective
self._construct_objective()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective
losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized
if loss_obj.get_loss_value_for_objective() is not None:
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective
self._prepare()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare
self._loss_value = self.loss.get_value()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value
fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata(
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata
edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op
edges, weights, start_end_states = tf_compat.v1.py_func(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler
return dispatch_target(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func
return py_func_common(func, inp, Tout, stateful, name=name)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common
return _internal_py_func(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func
result = gen_script_ops.py_func(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func
_, _, _op, _outputs = _op_def_library._apply_op_helper(
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper
op = g._create_op_internal(op_type_name, inputs, dtypes=None,
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal
ret = Operation.from_node_def(
Exception UnknownError() in step 0. (pid 3325131)
Failing op: <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc>
We tried to fetch the op inputs ([<tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>]) but got another exception:
target_op <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc>,
ops
[<tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder>]
�[31;1mEXCEPTION�[0m
�[34mTraceback (most recent call last):�[0m
�[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1379�[0m, �[34min�[0m BaseSession._do_call
�[34mline:�[0m �[34mreturn�[0m fn�[34m(�[0m�[34m*�[0margs�[34m)�[0m
�[34mlocals:�[0m
fn �[34;1m=�[0m �[34m<local>�[0m �[34m<�[0mfunction BaseSession�[34m.�[0m_do_run�[34m.�[0m�[34m<�[0mlocals�[34m>�[0m�[34m.�[0m_run_fn at 0x7f2192d77d30�[34m>�[0m
args �[34;1m=�[0m �[34m<local>�[0m �[34m(�[0m�[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m
�[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m00...
�[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1362�[0m, �[34min�[0m BaseSession._do_run.<locals>._run_fn
�[34mline:�[0m �[34mreturn�[0m self�[34m.�[0m_call_tf_sessionrun�[34m(�[0moptions�[34m,�[0m feed_dict�[34m,�[0m fetch_list�[34m,�[0m
target_list�[34m,�[0m run_metadata�[34m)�[0m
�[34mlocals:�[0m
self �[34;1m=�[0m �[34m<local>�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m
self�[34;1m.�[0m_call_tf_sessionrun �[34;1m=�[0m �[34m<local>�[0m �[34m<�[0mbound method BaseSession�[34m.�[0m_call_tf_sessionrun of �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m�[34m>�[0m
options �[34;1m=�[0m �[34m<local>�[0m �[34mNone�[0m
feed_dict �[34;1m=�[0m �[34m<local>�[0m �[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m
�[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m001...
fetch_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f24250d81b0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423f96cf0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423b01830�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Ou...
target_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa970�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa930�[34m>�[0m�[34m]�[0m
run_metadata �[34;1m=�[0m �[34m<local>�[0m �[34mNone�[0m
�[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1455�[0m, �[34min�[0m BaseSession._call_tf_sessionrun
�[34mline:�[0m �[34mreturn�[0m tf_session�[34m.�[0mTF_SessionRun_wrapper�[34m(�[0mself�[34m.�[0m_session�[34m,�[0m options�[34m,�[0m feed_dict�[34m,�[0m
fetch_list�[34m,�[0m target_list�[34m,�[0m
run_metadata�[34m)�[0m
�[34mlocals:�[0m
tf_session �[34;1m=�[0m �[34m<global>�[0m �[34m<�[0mmodule �[36m'tensorflow.python.client.pywrap_tf_session'�[0m �[34mfrom�[0m �[36m'/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/pywrap_tf_session.py'�[0m�[34m>�[0m
tf_session�[34;1m.�[0mTF_SessionRun_wrapper �[34;1m=�[0m �[34m<global>�[0m �[34m<�[0mbuilt�[34m-�[0m�[34min�[0m method TF_SessionRun_wrapper of PyCapsule object at 0x7f2538137300�[34m>�[0m
self �[34;1m=�[0m �[34m<local>�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m
self�[34;1m.�[0m_session �[34;1m=�[0m �[34m<local>�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Session object at 0x7f2423372a70�[34m>�[0m
options �[34;1m=�[0m �[34m<local>�[0m �[34mNone�[0m
feed_dict �[34;1m=�[0m �[34m<local>�[0m �[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m
�[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m
�[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m
�[34m[�[0m�[34m-�[0m0�[34m.�[0m001...
fetch_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f24250d81b0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423f96cf0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423b01830�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Ou...
target_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa970�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa930�[34m>�[0m�[34m]�[0m
run_metadata �[34;1m=�[0m �[34m<local>�[0m �[34mNone�[0m
�[31mUnknownError�[0m: 2 root error(s) found.
(0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed
Traceback (most recent call last):
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child
ret = self._read()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read
return Unpickler(p).load()
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__
ret = func(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper
return func(*args, **kwargs)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch
return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch
edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch
instance = self._get_instance(i)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance
self._maybe_create_new_instance()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance
self.instances.append(SprintSubprocessInstance(**self.sprint_opts))
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__
self.init()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init
self._start_child()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child
raise Exception("SprintSubprocessInstance Sprint init failed")
Exception: SprintSubprocessInstance Sprint init failed
[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]]
[[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]]
(1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed
Traceback (most recent call last):
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core
|
|
AFAIR, the problem occurs when running in apptainer environment only. The buffer does not contain all info and returnn crashes because of rasr automata being truncated/ not complete |
So for reference, the actual error is this:
|
I just tested the proposed patch and it does not fix the issue for my example. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@vieting I pushed sth which should fix this. Can you try? |
(For reference, there was also an EOFError in #1363, but I think that was another problem.) |
Note: I did not actually test my recent change, as I don't have any setup ready to try this. Please try it out and report if it works. |
Just tested and I still get the error. Log:
|
@albertz check |
@christophmluscher @NeoLegends does this relate to the rasr compiled with TF 2.13? Do you recognize this error? |
Is it maybe a problem that RASR was compiled with my old tf 2.8 image? I still use the same RASR binary with the new image. Loading the automata does not require tf, so I thought, that I can use the same RASR. |
@vieting I pushed another small change. Can you try again? |
Unfortunately, this still does not fix my example.
|
I get the same error when using a tf 2.14 image and RASR compiled using that image. |
Is that the original stdout + stderr, or just the log? It looks a bit like maybe RASR does not correctly starts at all? You should e.g. see this then on stdout: print("RETURNN SprintControl[pid %i] Python module load" % os.getpid()) And then: print(
(
"RETURNN SprintControl[pid %i] init: "
"name=%r, sprint_unit=%r, version_number=%r, callback=%r, ref=%r, config=%r, kwargs=%r"
)
% (os.getpid(), name, sprint_unit, version_number, callback, reference, config, kwargs)
) If you don't see that, then my recent fixes, and also Tinas patch are not really related to your issue at all. You should check the RASR log then. There should be some error by RASR, probably Python related, maybe sth like that it could not load the module or so. Maybe some import missing. |
What I posted before was from the log. The following is copied from stdout and stderr (with tf 2.14 image, also for RASR compilation):
|
The RASR log of the nn trainer does not contain anything that looks particularly suspicious to me. |
What about this?
|
And in your stdout, you see the actual error:
|
I just use |
Note that the segmentation fault only occurs with the tf2.14 image and RASR. There might be something wrong on that side as well, see . With my previous settings (tf2.13, RASR compiled with tf2.8), this is stdout + stderr
|
There it seems that RASR does not start at all. I see:
|
Btw, the RASR segmentation fault looks actually like a bug in RASR. RASR should never segfault. |
Most of rasr problems result in segmentation fault. Sometimes you get more
info, sometimes it's only about a not consistent compilation.
…On Wed, Nov 8, 2023, 17:43 Albert Zeyer ***@***.***> wrote:
Btw, the RASR segmentation fault looks actually like a bug in RASR. RASR
should never segfault.
—
Reply to this email directly, view it on GitHub
<#1456 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEQ6G5P4AKNQGS6QKNZFEMTYDOZD3AVCNFSM6AAAAAA7CWIXA2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBSGI3DQNZTGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Whenever RASR gives a segfault, that's a bug in RASR. It should never segfault. Can you link corresponding RASR issues here? Or if this is not reported yet, can you open a corresponding RASR issue? |
I created a RASR issue about the segfault in RASR with the tf2.14 image and RASR: rwth-i6/rasr#68 |
@vieting Did you look at that? Did you fix it? Maybe it just needs the right |
I just tried with the tf2.13 image and a RASR that was compiled without TF. There, I also get a segmentation fault. It looks identical to the one in rwth-i6/rasr#68.
|
So, on RETURNN/Python side, the last call before the crash is basically:
Everything happens inside RASR then ( |
I just saw that I get the same segmentation fault also with the old tf2.8 image and RASR without TF. So maybe this is about some mismatch. With tf2.8 image and RASR compiled with that image, the example I created runs properly. |
You mean this here, right? rwth-i6/rasr#47 The RASR without TF is from Bene on branch |
This comment was marked as resolved.
This comment was marked as resolved.
My tf 2.8 image and RASR compiled with that image works. All other combinations do not work, including the tf 2.14 image from rwth-i6/rasr#64 and RASR compiled with that image. |
And this is the same RASR version as in the other cases? |
Speech::CTCTopologyGraphBuilder::addLoopTransition
It seems like the RASR bug (causing seg fault) is fixed by rwth-i6/rasr#50. |
I created an apptainer image with tf 2.13 and tried to run a training with
FastBaumWelchLoss
. It crashes in step 0 because theget_sprint_automata_for_batch
op is not found.The actual error is this:
The text was updated successfully, but these errors were encountered: