    A core aspect of human speech comprehension is the ability to incrementally integrate consecutive words into a structured and coherent interpretation, aligning with the speaker\'s intended meaning. This rapid process is subject to multidimensional probabilistic constraints, including both linguistic knowledge and non-linguistic information within specific contexts, and it is their interpretative coherence that drives successful comprehension. To study the neural substrates of this process, we extract word-by-word measures of sentential structure from BERT, a deep language model, which effectively approximates the coherent outcomes of the dynamic interplay among various types of constraints. Using representational similarity analysis, we tested BERT parse depths and relevant corpus-based measures against the spatiotemporally resolved brain activity recorded by electro-/magnetoencephalography when participants were listening to the same sentences. Our results provide a detailed picture of the neurobiological processes involved in the incremental construction of structured interpretations. These findings show when and where coherent interpretations emerge through the evaluation and integration of multifaceted constraints in the brain, which engages bilateral brain regions extending beyond the classical fronto-temporal language system. Furthermore, this study provides empirical evidence supporting the use of artificial neural networks as computational models for revealing the neural dynamics underpinning complex cognitive processes in the brain.






    A central topic in psycholinguistics is the study of how and when the parser assigns an antecedent to referentially-dependent elements. One such referentially-dependent element is the null subject of non-finite clauses. The aim of the present study was to examine the role of verb control information in the assignment of an antecedent to such a null subject. The results so far are inconclusive. Some authors argue that verb control information has a late influence, whereas others argue that such verb-specific information has a very rapid influence. We report a self-paced reading study in Spanish in which verb type (subject vs. object control) and grammaticality (grammatical vs. ungrammatical) were manipulated. The grammaticality manipulation was carried out by introducing a person anomaly at the infinitive itself, and not at a later word (e.g., \"Te prometí/aconsejé adelgazarme/adelgazarte cinco quilos en un mes.\" Literal translation, \"I to you promised/advised to losemyself/yourself five kilos in a month\"). With such a manipulation we can examine whether at the first possible point (i.e., the infinitive) verb control information was used to assign the correct antecedent (i.e., the subject in sentences with a subject-control verb, and the object in sentences with an object-control verb) to PRO. The results showed that at the infinitive there was a main effect of grammaticality, meaning that the correct antecedent has already been assigned to PRO. The present findings are consistent with models that assume that verb-specific information plays an important role in the initial stages of sentence processing.






    In online language comprehension, the parser incrementally builds hierarchical syntactic structures. The predictive nature of this structure-building process has been the subject of extensive debate. A previous study observed that when a wh-phrase indicates parallelism between the upcoming wh-clause and a preceding clause (e.g., John told some stories, but we couldn\'t remember which stories…), the parser predictively constructs the wh-clause. This observation demonstrates predictive structure building. However, the study also suggests that the parser does not make a prediction when the wh-phrase indicates that parallelism does not hold (e.g., John told some stories … with which stories…), a potential limit to the prediction of syntactic structures. Crucially, these findings are controversial because the study did not observe processing difficulty when disambiguating input indicated that the predicted continuation was inconsistent with the globally grammatical structure (garden-path effects). The controversial results may be due to a lack of statistical power. Therefore, the present study conducted a large-scale replication study (324 participants and 24 sets of materials). The results revealed that the parser predicts the clausal structure, irrespective of the type of wh-phrase. There was also evidence of garden-path effects, supporting the finding that the parser makes a prediction. These observations suggest that the prediction algorithm inherent in the human parser is more powerful than assumed by the previous study and that the parser attempts to construct globally grammatical structures during revision.






    Proper identification of collocations (and more generally of multiword expressions (MWEs), is an important qualitative step for several NLP applications and particularly so for translation. Since many MWEs cannot be translated literally, failure to identify them yields at best inaccurate translation. This paper is mostly be concerned with collocations. We will show how they differ from other types of MWEs and how they can be successfully parsed and translated by means of a grammar-based parser and translator.






    PubMed is an invaluable resource for the biomedical community. Although PubMed is freely available, the existing API is not designed for large-scale analyses and the XML structure of the underlying data is inconvenient for complex queries. We developed an R package called pmparser to convert the data in PubMed to a relational database. Our implementation of the database, called PMDB, currently contains data on over 31 million PubMed Identifiers (PMIDs) and is updated regularly. Together, pmparser and PMDB can enable large-scale, reproducible, and transparent analyses of the biomedical literature. pmparser is licensed under GPL-2 and available at PMDB is available in both PostgreSQL (DOI 10.5281/zenodo.4008109) and Google BigQuery (






    BACKGROUND:  : Syntactic analysis, or parsing, is a key task in natural language processing and a required component for many text mining approaches. In recent years, Universal Dependencies (UD) has emerged as the leading formalism for dependency parsing. While a number of recent tasks centering on UD have substantially advanced the state of the art in multilingual parsing, there has been only little study of parsing texts from specialized domains such as biomedicine.
    METHODS:  : We explore the application of state-of-the-art neural dependency parsing methods to biomedical text using the recently introduced CRAFT-SA shared task dataset. The CRAFT-SA task broadly follows the UD representation and recent UD task conventions, allowing us to fine-tune the UD-compatible Turku Neural Parser and UDify neural parsers to the task. We further evaluate the effect of transfer learning using a broad selection of BERT models, including several models pre-trained specifically for biomedical text processing.
    RESULTS:  : We find that recently introduced neural parsing technology is capable of generating highly accurate analyses of biomedical text, substantially improving on the best performance reported in the original CRAFT-SA shared task. We also find that initialization using a deep transfer learning model pre-trained on in-domain texts is key to maximizing the performance of the parsing methods.







    For a genome-wide association study in humans, genotype imputation is an essential analysis tool for improving association mapping power. When IMPUTE software is used for imputation analysis, an imputation output (GEN format) should be converted to variant call format (VCF) with imputed genotype dosage for association analysis. However, the conversion requires multiple software packages in a pipeline with a large amount of processing time.
    We developed GEN2VCF, a fast and convenient GEN format to VCF conversion tool with dosage support.
    The performance of GEN2VCF was compared to BCFtools, QCTOOL, and Oncofunco. The test data set was a 1 Mb GEN-formatted file of 5000 samples. To determine the performance of various sample sizes, tests were performed from 1000 to 5000 samples with a step size of 1000. Runtime and memory usage were used as performance measures.
    GEN2VCF showed drastically increased performances with respect to runtime and memory usage. Runtime and memory usage of GEN2VCF was at least 1.4- and 7.4-fold lower compared to other methods, respectively.
    GEN2VCF provides users with efficient conversion from GEN format to VCF with the best-guessed genotype, genotype posterior probabilities, and genotype dosage, as well as great flexibility in implementation with other software packages in a pipeline.







    Brain activity in numerous perisylvian brain regions is modulated by the expectedness of linguistic stimuli. We leverage recent advances in computational parsing models to test what representations guide the processes reflected by this activity. Recurrent Neural Network Grammars (RNNGs) are generative models of (tree, string) pairs that use neural networks to drive derivational choices. Parsing with them yields a variety of incremental complexity metrics that we evaluate against a publicly available fMRI data-set recorded while participants simply listen to an audiobook story. Surprisal, which captures a word\'s un-expectedness, correlates with a wide range of temporal and frontal regions when it is calculated based on word-sequence information using a top-performing LSTM neural network language model. The explicit encoding of hierarchy afforded by the RNNG additionally captures activity in left posterior temporal areas. A separate metric tracking the number of derivational steps taken between words correlates with activity in the left temporal lobe and inferior frontal gyrus. This pattern of results narrows down the kinds of linguistic representations at play during predictive processing across the brain\'s language network.






    To understand human language-both spoken and signed-the listener or viewer has to parse the continuous external signal into components. The question of what those components are (e.g., phrases, words, sounds, phonemes?) has been a subject of long-standing debate. We re-frame this question to ask: What properties of the incoming visual or auditory signal are indispensable to eliciting language comprehension? In this review, we assess the phenomenon of language parsing from modality-independent viewpoint. We show that the interplay between dynamic changes in the entropy of the signal and between neural entrainment to the signal at syllable level (4-5 Hz range) is causally related to language comprehension in both speech and sign language. This modality-independent Entropy Syllable Parsing model for the linguistic signal offers insight into the mechanisms of language processing, suggesting common neurocomputational bases for syllables in speech and sign language. This article is categorized under: Linguistics > Linguistic Theory Linguistics > Language in Mind and Brain Linguistics > Computational Models of Language Psychology > Language.






    Syntactic and semantic information processing can interact selectively during language comprehension. However, the nature and extent of the interactions, in particular of semantic effects on syntax, remain to some extent elusive. We revisit an influential ERP result by Kim and Osterhout (2005), later replicated by Kim and Sikos (2011), that the verb in sentences such as \'The hearty meal was devouring … \' evokes a P600 effect-a signature of syntactic processing difficulty-even though all stimuli were grammatically well-formed. We view this effect as a manifestation of a conflict in the assignment of grammatical subject and object roles to the verb\'s arguments as performed independently by a semantic system (predicting that meal should be the object) and by a syntactic system (labeling meal as the subject). More specifically, we develop an explicit algorithmic implementation of a parallel processing architecture that supports (i) meaning-based prediction of grammatical role labels, using either a probabilistic label guesser or a neural network, and (ii) comparison of the predicted labels with labels assigned by a state-of-the-art dependency parser. We demonstrate that the system can classify sentences from the Kim and Osterhout (2005) corpus with adequate accuracy, and can detect labeling conflicts as intended. Some implications of our results for models of prediction in language processing are discussed.





