Hakala2024

This repository contains the code and supplementary figures utilized in the paper: "Subword representations successfully decode brain responses to morphologically complex written words" by Tero Hakala, Tiina Lindh-Knuutila, Annika Hultén, Minna Lehtonen, and Riitta Salmelin Department of Neuroscience and Biomedical Engineering, Aalto University, Finland

21.5.2024

This should work at least with the following version:

Python 3.10.13 sklearn.version '1.3.2' scipy.version '1.11.3'

quickstart

Download megdata and word vectors from OSF into their respective folders. Folder structure has been created for your convenience. The project data is stored at https://osf.io/2cbzw/ (DOI 10.17605/OSF.IO/2CBZW).

To test whether everything works, you can try running the following command in the morppirepo/zeroshot directory.

python run_zero_shot_py3.py -d cosine --verbose --vocab ../vekt/trigrams/vocab.tsv --wordvec ../vekt/trigrams/vectors.tsv --morphemes ../vekt/trigrams/morppisanat_170_trigrams.txt --nosave 1 --output testrun_trigrams.txt ../megdata/megdata_filt40_check_noresamp_8.mat

This command should calculate the decoding accuracy for word vectors constructed from 3-gram segment vectors, using the MEG time window spanning 350-450ms, and store the resulting accuracy in the file: testrun_trigrams.txt

This accuracy should be around 0.6424.

Contents by directory:

zeroshot

run_zero_shot_py3.py This code is for decoding word vectors from evoked brain activation. It serves as a frontend to the ridge-regression function in the scikit-learn library.

The script takes as input the complete word vectors and MEG data vectors.

Alternatively, instead of complete word vectors, it can take a list of segmented words and the individual segment vectors. It then constructs the word vectors by summing the corresponding segment vectors for each word.

The script can perform permutation shuffling either at the word level (shuffling word labels) or at the segment level (shuffling segment labels before constructing the word vectors, as described in the paper).

megdata

The MEG data is hosted at OSF. Download it into this folder structure to run the experiments.

Evoked potentials for each word in the experiment, averaged over subjects.

306 Channels. Data is sampled at 1000 Hz Low-pass filtered at 40Hz.

Cleaned from artifacts (limit 3000fT/cm for gradiometers). ICA components of ocular artifacts are removed.

Data with incorrect behavioral responses are removed before averaging over subjects.

MNE Evokeds files (averaged over subjects for each word): evokeds_averages.fif

event tiggercodes and corresponding stimulus words (only real words are used): triggercodes.csv

In this file, the stimulus onset is at point 238 from the beginning (200ms baseline + 38 ms delay due to the projector refresh rate until the stimulus is visible)

Data is split into 100ms timewindows (overlap 50ms) MEG datavectors used in the decoding:

filename	time window
megdata_filt40_check_noresamp_1.mat	0 - 100 ms
megdata_filt40_check_noresamp_2.mat	50 - 150
megdata_filt40_check_noresamp_3.mat	100 - 200
megdata_filt40_check_noresamp_4.mat	150 - 250
megdata_filt40_check_noresamp_5.mat	200 - 300
megdata_filt40_check_noresamp_6.mat	250 - 350
megdata_filt40_check_noresamp_7.mat	300 - 400
megdata_filt40_check_noresamp_8.mat	350 - 450
megdata_filt40_check_noresamp_9.mat	400 - 500
megdata_filt40_check_noresamp_10.mat	450 - 550
megdata_filt40_check_noresamp_11.mat	500 - 600
megdata_filt40_check_noresamp_12.mat	550 - 650
megdata_filt40_check_noresamp_13.mat	600 - 700
megdata_filt40_check_noresamp_14.mat	650 - 750
megdata_filt40_check_noresamp_15.mat	700 - 800

wordvectors

The segment vectors were constructed using the gensim word2vec skip-gram model with the Finnish internet corpus in collaboration with TurkuNLP group. (Luotolahti, J., Kanerva, J., Laippala, V., Pyysalo, S., & Ginter, F. (2015). Towards universal web parsebanks. Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015), 211–220.)

Before running word2vec, each word in the corpus was segmented into word segments according to respective segmentation schemes. Surface corresponds to whole words (no segmentation).

The subdirectories contain word/segment vectors for various segmentations.

folder name	explanation
surface	Whole words
unigrams	1-grams
bigrams	2-grams
trigrams	3-grams
morfessor	Segmentations by Morfessor, a statistical model of morphology
ling	Linguistic segmentation by commercial Lingsoft utility
random	Random segmentation
morfessor_modified	Morfessor segmentation on modified corpus (experiment words removed from the corpus)
ling_modified	Linguistic segmentation on modified corpus (experiment words removed from the corpus)

The morfessor, ling, and surface directories are further divided into w1 to w7 subdirectories. These correspond to specific skip-gram window lengths that were used to construct the vectors (e.g., w7: 7 segments before to 7 segments after the target).

Each directory contains the following files:

morppisanat_170_bigrams.txt List of segmented words (in this case, segmented into 2-grams)
vocab.tsv List of individual segments (these come from word2vec)
vectors.tsv Vectors for the segments in vocab.tsv (these come from word2vec)
bigrams_170_w7_sum.mat Complete word vectors constructed by summing corresponding segments for each word (This file is not needed anymore as the zeroshot script can also construct the word vectors from the segments)

Experimental stimuli

The screen images of the words shown during the experiment. Hosted at OSF.

supplementary

Figures of containing the hierarchical clustering (complete linkage, cosine) of the different word vectors in pdf format.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
experiment_stimuli		experiment_stimuli
megdata		megdata
supplementary		supplementary
wordvectors		wordvectors
zeroshot		zeroshot
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hakala2024

About

Releases 1

Packages

Languages

AaltoImagingLanguage/Hakala2024

Folders and files

Latest commit

History

Repository files navigation

Hakala2024

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages