Funded PhD position - ASR in pathological speech
An exciting and ambitious PhD position at Imperial College London is open for suitable candidates interest in developing state-of-the-art ASR development for pathological speech recognition. The PhD is fully funded for home students (UK home students). There is scope for international applicants to apply either fully self funded or partially self funded by industry sponsorship joining with Imperial College Funding. The PhD project is part of the UKRI Centre for Doctoral Training in Artificial Intelligence for Healthcare at Imperial College London. (See https://ai4health.io/apply/ ). When you apply please specify your interest in (The ACOUSTICS study) Project ID 319 Automated community asessment of aphasic stroke.
Healthcare supervisor and main contact:
Fatemeh Geranmayeh, [email protected]
AI Supervisor:
Patrick Naylor, [email protected]
Project Motivation
Aphasia (language impairment) is common after stroke. Developing Automated Speech Recognition (ASR) models that detect a variety of clinically pertinent speech errors would be a game-changing advance in monitoring and diagnosing patients in the community. The overarching aim of this project will be to advance ASR in patients with post-stroke aphasia to a) monitor patient’s speech in their home setting after stroke and b) identify clinically useful speech errors with potential to be used as targets for speech and language therapy.
Project proposal
Work stages are:
1) Application of pre-defined speech labels and speech segmentation in a large data base of meticulously transcribed aphasic speech segments from english speaking patients
2) Model designing and representation learning
a. Selection of the relevant acoustic features
b. Transcription of speech using various pre-trained automated speech recognition (ASR) models across different linguistics levels: phoneme, words, and sentence
c. Designing novel speech recognition algorithms considering multimodality and multitasking constraints.
3) Training and Fine-tuning
4) Model Evaluation and Error Estimation
a. Validating ASR performance a) on different speech tasks and b) against clinical scales
b. Designing new errors estimation algorithm to examine the effect of stroke lesion characteristics on ASR accuracy
c. Using ASR-derived speech metrics to predict aphasia clinical severity
5) Real-time deployment using online speech tools already in place (IC3)
The available training data will contain audio and/or audiovisual speech data and detailed transcripts to train multimodal ASR models. These models will incorporate acoustic features, visual features, phonological transcripts (using International Phonetic Alphabet), transcripts containing linguistic error coding (phonological, semantic, neologism), paralinguistic coding (speech quality, emotion, accent), and dysfluency coding (speech sound distortions, elongation, repetition, dysfluencies and pauses). Therefore they will derive measures of fluency, lexical and semantic content, syntactic completeness, linguistic complexity and errors.
Speech data will be segmented, denoised and acoustic features (e.g. Mel-frequency cepstral coefficient-MFCC) will be extracted from the speech spectrograms. Aligners (Htk3, KALDI, or de novo) will be tuned based on training data to determine the start and end of a phoneme/word. These will be fed into recurrent neural networks and deep Bidirectional LongShort Term Memory Recurrent Neural Network (BLSTM-RNN).
Supervised and unsupervised learning models will be used. Recognition models will be used at different levels: phoneme-level (Hidden Markov Model/Gaussian Mixture Model), lexical models to detect words, and language models to detect utterances. At each level error estimation will be evaluated. Keyword spotting will be used to detect expected target words.
Main AI challenges or advances expected
A comprehensive speech assessment in patients using Automated Speech Recognition has been slow to catch up with that used in healthy speech. This is due to a high variability in speech errors, both between aphasic individuals and within individuals over time, as well as lack of good training datasets. Furthermore, ASR implementation in patients is also challenging from an AI point of view given a high dimensional output space and a complex sequence to sequence problem. It would be challenging to model multimodal/multidimentional speech audio, video, linguistic errors, dysfluencies, paralinguistic features, and stroke lesions, in parallel, to assess speech remotely.
Main AI approaches in keywords
Representation Learning, Deep Neural Networks, Automatic Speech Recognition, Diagnosis/Classification, Acoustic models
Expected impact that the project will have within 3.5 years
This work will advance the state-of-the-art in aphasic speech processing, enhance AI based diagnosis and monitoring of patients with speech and language disorders in community, and will be a foundation for developing remote automated therapy apps for speech production where an AI therapist can monitor and guide the rehabilitation in the patient’s own home. The advances will be applicable not only to patients with post-stroke aphasia, but can be used to used to detect early clinical progression in patients with dementias, brain tumours, and motor neuron disease.
Student background for this project
The prospective student should have a strong programming/coding and computing skills set from any related background of mathematics, physics, or engineering. Some linguistic knowledge would be highly desirable. Successful applicants normally have a First Class (4-year-undergraduate degree) and/or Distinction level Master’s degree.