Tuesday, August 20, 2019
Factor Language Model Programming
Factor Language Model Programming Language Model Language model helps a speech recognizer figure out how likely a word sequence is, independent of acoustics. There is a linguistic and statistical approach to calculate the probability. The linguistic technique tries to understand the syntactic and semantic structure of a language and derive the probabilities of word sequences using this knowledge. The challenge here is to have proper co occurrence statistics of the unit of recognition. The approach in use evaluates a huge text corpus in a statistical way and word transitions. Current language models make no use of the syntactic properties of natural language but rather use very simple statistics such as word co-occurrences. Recent results show that incorporating syntactic constraints in a statistical language model reduces the word error rate on a conventional dictation task by 10% [M.S.Salam, 2009]. Proposed Language Model The approach proposed here uses factored language model which incorporates the morphological knowledge. Factored language models have recently been proposed for incorporating morphological knowledge in the modeling lexicon. As suffix and compound words are the cause of the growth of the vocabulary, a logical idea is to split the words into shorter units. The language model proposed in this research is based on morphology. A morphological analyser obtains and verifies the internal structure of a given complete word form [Rosenfield, 2000]. Building a morphological analyser for highly inflecting, agglutinative languages is a challenging task. It is very difficult to build a high performance analyser for such languages. The main idea here is to divide a given word form into a stem and single suffix. Morphology plays a much greater role in Telugu. An inflected Telugu word starts with a stem and may have suffix(s) added to the right according to complex rules of saMdhi. This research proposes a new data structure based on Inverted Index and an efficient algorithm for accessing its elements. Few researchers have used tries for efficient retrieval from dictionary, earlier. This research work is different from earlier work in two ways: a) variation to the structure of trie b) the method of identifying and combining inflections. Modified Trie Structure A trie is a tree based data structure for storing strings in order to support fast pattern matching. A trie T represents the strings of set S of n strings with paths from root to the external node of T. Fig 5.1: Original Trie Structure The trie considered here is different from standard trie in two ways: 1) A standard trie does not allow a word to be prefix of another, but the proposed trie structure allows a word to be prefix of another word. The node structure and search algorithm also is given according to this new property. 2) Each word in a standard trie ends at an external node, where as in the modified trie a word may end at either an external node, or the internal node. Irrespective of whether the word ends at internal node or external node, the node stores the index of the associated word in the occurrence list. The node structure is changed such that, each node of the trie is represented by a triplet C,R,Ind>. C represents character stored at that node. R represents whether the concatenation of characters from root till that node forms a meaningful stem word. Its value is 1, if characters from root node to that node form a stem, 0 otherwise. Ind represents index of the occurrence list. Its value depends on the value of R. Its value is -1 (negative 1), if R=0, indicating it is not a valid stem. So no index of occurrence list matches with it. If R=1, its value is index of occurrence list of associated stem. Fig 5.2: Modified Trie Structure Advantages relative to binary search tree: The following are the main advantages of tries overbinary search trees(BSTs): Looking up keys is faster. Looking up a key of lengthmtakes worst caseO(m) time. A BST performs O(log(n)) comparisons of keys, wherenis the number of elements in the tree, because lookups depend on the depth of the tree, which is logarithmic in the number of keys if the tree is balanced. Hence in the worst case, a BST takes O(mlogn) time. Moreover, in the worst case log(n) will approachm. Also, the simple operations tries use during lookup, such as array indexing using a character, are fast on real machines. Tries can require less space when they contain a large number of short strings, because the keys are not stored explicitly and nodes are shared between keys with common initial subsequences. Tries facilitatinglongest-prefix matching, helping to find the key sharing the longest possible prefix of characters all unique. Corpus structure of proposed Language Model The corpus consists of the following modules: Stem word dictionary This accommodates all the stems of the language. Stem word dictionary is implemented as an Inverted Index for better efficiency. The Inverted index will have the following two data structures in it: 1) Occurrence list: It is an array of pairs, 2) Stem trie: consisting of stem words Occurrence list is constructed based on the grammar of the language, where each entry of the list contains the pair (ii) Inflection Dictionary This dictionary contains the list of all possible inflections of the Telugu language. Each entry of Stem word dictionary lists the indexes of this dictionary to indicate which all inflections are possible with that stem. The proposed corpus structure helps in reducing the corpus size drastically. Every stem word may have number of inflections possible. If the inflected words are stored as it is, then corpus size would be m*n, where m is number of stem words and n is number of inflections. Instead of storing all the inflected words, the proposed corpus structure stores stem words and inflections separately, and handles the inflected words through morphology. Hence the corpus size required is for m stem words and n inflections i.e., m+n. Thus there is a great reduction in the corpus size. For a corpus of 1000 stem words and 10 inflections, the required corpus size is 1000+10=1010, which otherwise would have required 1000*10=10000. Fig 5.3 : Corpus structure of proposed Language Model Textual Word Segmentation using Proposed Language Model The proposed language model is used to develop a textual word segmenter. A word segmenter is used to divide the given inflected word into a stem and single inflection. This is required as the corpus stores stems and inflections separately. Input the word segmenter is an Inflected word. Syllabifier takes this word and divides the word into syllables and identifies if the letter is a vowel or a consonant. After applying the rules syllabified form of the input will be obtained. Once the process of syllabification is done, this will be taken up by the analyzer. Analyzer separates the stem and inflection part of the given word. This stem word will be validated by comparing it with the stem words present in stem dictionary. If the stem word is present, then the inflection of the input word will be compared with the inflections present in inflection dictionary of the given stem word. If both the inflections get matched then it will directly displays the output otherwise it takes the appropriate inflection(s) through comparison and then displays. Syllabification is the separation of the words into syllables, where syllables are considered as phonological building blocks of words. It is dividing the word in the way of our pronunciation. The separation is marked by hyphen. In the morphological analyzer, the main objective is to divide the given word into root word and the inflection. For this, we divide the given input word into syllables and we compare the syllables with the root words and inflections to get the root word and appropriate inflection. Fig 5.4: Block diagram of Word Segmentr for text Steps for word segmentation Receiving the inflected word as an input from the user. Syllabify the input Analyze the input and validating the stem word. Identify the appropriate inflection for the given stem word by comparing the inflection of given word with the inflections present in inflection dictionary of the stem word. Displaying the appropriate inflected word. For example, considering the word ââ¬Å"nAnnagarikiâ⬠(à à °Ã ¨Ã à °Ã ¾Ã à °Ã ¨Ã à ±Ã à à °Ã ¨Ã à °-à à °Ã ¾Ã à °Ã °Ã à °Ã ¿Ã à °Ã¢â¬ ¢Ã à °Ã ¿) meaning ââ¬Å"to fatherâ⬠, the input is given the user in Roman transliteration format. This input is basically divided into lexemes as: Now, the array is processed which gives the type of lexeme by applying the rules of syllabification one by one. Applying Rule 1: ââ¬Å" No two vowels come together in Telugu literature.â⬠The given user input does not have two vowels together. Hence this rule is satisfied by the given user input. The output after applying this rule is same as above. If the rule is not satisfied, an error message is displayed that the given input is incorrect. Now the array is: c ââ¬â v ââ¬â c ââ¬â c ââ¬â v ââ¬â c ââ¬â v ââ¬â c ââ¬â v ââ¬â c ââ¬â v Applying Rule 2: ââ¬Å" Initial and final consonants in a word go with the first and last vowel respectively.â⬠Telugu literature rarely has the words which end up with a consonant. Mostly all the Telugu words end with a vowel. So this rule does not mean the consonant that ends up with the string, but it means the last consonant in string. The application of this rule2 changes the array as following: c ââ¬â v ââ¬â c ââ¬â cââ¬â v ââ¬â c ââ¬â v ââ¬â c ââ¬â v ââ¬â c ââ¬â v cv ââ¬â c ââ¬â c ââ¬â v ââ¬â c ââ¬â v ââ¬â c ââ¬â v ââ¬â cv This generated output is further processed by applying the other rules. Applying Rule 3: ââ¬Å" VCV: The C goes with the right vowel.â⬠The string wherever has the form of VCV, then this rule is applied by dividing it as V ââ¬â CV. In the above rule the consonant is combined with the vowel, but here in this rule the consonant is combined with the right vowel and separated from the left vowel. To the output generated by the application of rule2, this rule is applied and the output will be as: cv ââ¬â c ââ¬â c ââ¬â v ââ¬â c ââ¬â v ââ¬â c ââ¬â v ââ¬â cv cv ââ¬â c ââ¬â c ââ¬â v ââ¬â cv ââ¬â cv ââ¬â cv This output is not yet completely syllabified, one more rule is to be applied which finishes the syllabification of the given user input word. Applying Rule 4: ââ¬Å" Two or more Cs between Vs First C goes to the left and the rest to right.â⬠It is the string which is in the form of VCCC*V, then according to this rule it is split as VC ââ¬â CC*V. In the above output VCCV in the string can be syllabified as VC ââ¬â CV. Then the output becomes: cv ââ¬â c ââ¬â c ââ¬â v ââ¬â cv ââ¬â cv ââ¬â cv cvcââ¬â cv ââ¬â cv ââ¬â cv ââ¬â cv Now this output is converted to the respective consonants and vowels. Thus giving the complete syllabified form of the given user input. nAn ââ¬â na ââ¬âcA ââ¬â ri ââ¬â ku cvc ââ¬â cv ââ¬â cv ââ¬â cv ââ¬â cv Hence, for the given user input, ââ¬Å"nAnnagArikiâ⬠, the generated syllabified form is, ââ¬Å"nAn ââ¬â na ââ¬â gA ââ¬â ri ââ¬â kiâ⬠. Fig 5.5: Word Segmenter showing an inflected word without change in stem form Fig 5.6: Word Segmenter showing an inflected word with a change in stem form SCIL Speech Corrector for Indian Languages In inflectional language every word consists of one or several morphemes into which the word can be segmented. The approach used here aims at reducing the above mentioned problem of having a very huge corpus for good recognition accuracy. It exploits the characteristic of Telugu language that every word consists of one or several morphemes into which the word can be segmented. SCIL is a procedure To deal with complex word forms applied after recognition Using which misrecognized words are corrected Architecture of SCIL The design of Speech Corrector for Indian Languages, consists of the Syllable Identifier, Phone Sequence Generator, Word Segmenter, and Morpho- Syntactic Analyzer modules. Input speech is decoded by a normal ASR system which gives the identified word as a string. The sequence of phones would be the input to the Word Segmenter module which matches the phonetized input with the root words stored in dictionary module, and generates a possible set of root words. Morpho-Syntactic Analyzer compares the inflection part of the signal with the possible inflections list from the database and gives correct inflection. This will be given to Morph Analyzer to apply morpho-syntactic rules of the language and gives the correct inflected word. Fig 5.7: Block diagram of SCIL i) Syllable Identifier Syllable identifier marks the rough boundaries of the syllables and labels them. At this stage , we get list of syllables separated with hyphen. The user input is syllabified and this would be the input to the next module. E.g. dE-vA-la-yA-ku ii) Phone Sequence Generator As the words in the dictionary are stored at phone level transcription, this module generates the phone sequences from the syllables. E.g. d-E-v-A-l-a-y-A-k-u iii) Word Segmentor This module compares the phonetized input from starting with the root words stored in dictionary module and lists the possible set of root words. The possible root word is dEvAlayamu. iv) Dictionary Dictionary contains stems and inflections separately. It does not store inflected words as it is very difficult, if not impossible, to cover all inflected words of the language. The database consists of 2 dictionaries: Stem Dictionary Inflection Dictionary Stem dictionary contains the stem words of the language, signal information for that stem which includes the duration and location of that utterance and list of indices of inflection dictionary which are possible with that stem word. Inflection Dictionary contains the inflections of the language, signal information for that inflection which includes the duration and location of that utterance. Both the dictionaries are implemented using trie structure in order to reduce the search space. v) Morpho Syntactic Analyzer This module compares the inflection part of the signal with the possible inflections list from the database and gives correct inflection. This will be given to Morph Analyzer to apply morpho-syntactic rules of the language and gives the correct inflected word. Post Recognition Procedure Capture the utterance, an isolated inflected word. Get its syllabified form. Generate phone sequence from the syllabified word. Compare the phone sequences with stem words in the dictionary and identify the stem. Segment the word into stem and inflection. Get the list of possible inflections. Compare the inflection signals possible with that stem one by one and apply morpho-syntactic rules of the language to combine stem and inflection. Display the inflected word. Using the rules the possible set of root words are combined with possible set of inflections and the obtained results are compared with the given user input and the nearest possible root word and inflection are displayed if the given input is correct. If the given input is not correct then the inflection part of the given input word is compared with the inflections of that particular root word and identifies the nearest possible inflection and combines the root word with those identified inflections, applies sandhi rules and displays the output. When there is more than one root word or more than one inflection has minimum edit distance then the model will display all the possible options. User can choose the correct one from that. For example, when the given word is pustakaMdO (à à °Ã ªÃ à ±Ã à à °Ã ¸Ã à ±Ã à à °Ã ¤Ã à °Ã¢â¬ ¢Ã à °Ã¢â¬Å¡Ã à °Ã ¦Ã à ±Ã¢â¬ ¹), the inflections tO making it pustakaMtO (à à °Ã ªÃ à ±Ã à à °Ã ¸Ã à ±Ã à à °Ã ¤Ã à °Ã¢â¬ ¢Ã à °Ã¢â¬Å¡Ã à °Ã ¤Ã à ±Ã¢â¬ ¹) meaning ââ¬Ëwith the bookââ¬â¢ and lO making it pustakaMlO (à à °Ã ªÃ à ±Ã à à °Ã ¸Ã à ±Ã à à °Ã ¤Ã à °Ã¢â¬ ¢Ã à °Ã¢â¬Å¡Ã à °Ã ²Ã à ±Ã¢â¬ ¹) meaning ââ¬Ëin the bookââ¬â¢) mis are possible. Present work will list both the words and user is given the option. We are working on improving this by selecting the appropriate word based on the context. SCIL Algorithm W=Utterance.wav Syl[]=SyllableIdentifier(W) Phone[]=phonetizer(Syl[]) Stem=getStem(Syl[]) Infl[]=getInflections(Stem) While (not exactMatch) word=MorphAnalyzer(stem,inflMatch) display word Stop Working of SCIL Once possible root words identified the given word is segmented into two parts, first being the root word and second part inflection. Now the inflection part is compared in the reverse direction for a match in the inflection dictionary. It will consider only the inflections that are mentioned against the possible root words, thus reducing the search space and making the algorithm faster. For example consider ââ¬Å"nAnnagarikiâ⬠(à à °Ã ¨Ã à °Ã ¾Ã à °Ã ¨Ã à ±Ã à à °Ã ¨Ã à °-à à °Ã ¾Ã à °Ã °Ã à °Ã ¿Ã à °Ã¢â¬ ¢Ã à °Ã ¿) meaning ââ¬Å"to fatherâ⬠, is misrecognized as nAn-na-cA-ri-ku (à à °Ã ¨Ã à °Ã ¾Ã à °Ã ¨Ã à ±Ã à à °Ã ¨Ã à °Ã
¡Ã à °Ã ¾Ã à °Ã °Ã à °Ã ¿Ã à °Ã¢â¬ ¢Ã à ±Ã ) then SCIL is applied and will correct the recognition error as follows: The output from ASR is nAn-na-cA-ri-ku. The phone sequence generator will generate the phone sequence as n-A-n-n-a-c-A-r-i-k-u. Now, match it with the set of root words stored in dictionary module. This process will identify the possible set of root words from the Stem dictionary as follows: Once possible root words identified the given word is segmented into two parts, first being the root word and second part inflection. Now the inflection part is compared for a match in the inflection dictionary. It will consider only the inflections that are mentioned against the possible root words, thus reducing the search space and making the algorithm faster. Possible set of inflections in inflections dictionary After getting the possible set of root words and possible set of inflections they are combined with the help of SaMdhi formation rules. Here in this example cA-ri-ku is compared with the inflections of the root word nAnna After comparing it identifies gAriki as the nearest possible inflection and combines the root word with the inflection and displays the output as ââ¬Å"nAnnagArikiâ⬠. Conclusions Language model proposed in this work results in reduction in corpus size by using factored approach. The search process is fastened by use of trie based structure. A change to standard trie is proposed. A post recognition procedure SCIL, is designed which uses the proposed language model and corrects the words misrecognized at inflections. The approach is tested using 1500 speech samples. These samples consist of 100 distinct words , each word repeated 3 times and recorded by 5 speakers in the age group 18-50. It is implemented as a speaker dependent system. An average model is built from the three utterances of each word for each speaker. Each speaker is given a unique ID, using which average model of that speaker is used for testing.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.