Chris Pollett >
Students > [Bio] [Blog] CS 297 Deliverables CS 298

CS297 Deliverable 2DocumentLevel Machine Translation with Hierarchical AttentionExperiments with statistical machine translationYuTang Shen (yutang.shen@sjsu.edu) Advisor: Dr. Chris Pollett [PDF version of this page] [Implementation code]Introduction on Statistical Machine TranslationWhile rulebased machine translation (RBMT) was thriving from 1930s to 1950s, Warren Weaver proposed an idea of utilizing statistics in languages to produce translations [1]. Although RBMT could provide satisfying translation, it required complex rule definition, preediting, and postediting. In contrast, statistical machine translation (SMT) proposed to simply look at the relationship between the statistics of the source language (SL) and the probability of all the possible candidates in target language (TL). With SMT, it was not necessary for programmers to define every single rule between SL and TL, instead, the task became gathering abundant amount of data to obtain an unbiased statistic for every word. To mathematically describe SMT task, the job can be described as find the best candidate \( \hat t \) for the source language vocabulary \( s\) from the TL distribution \( T\). $$\hat t = argmax_{t \in T} P(ts)$$ With more complicated model, which will be covered in section II, parameters \(p\) will be added to the equation to improve the translation quality. $$\hat t = argmax_{t \in T} P(ts; p)$$ Statistical Machine Translation TypesThere are various designs in SMT, including implementation from Charniak [2], IBM [3], etc. Greedy algorithmGreedy algorithm, as the name suggests, chooses the best candidate based on its perspective and disregards all other information. A simple greedy algorithm can be described as follows: 1 translate(S) 2 result = NULL 3 for s in S 4 candidates = getCandidate(s) 5 bestCandidate, bestScore = NULL, 1 6 for candidate in candidates 7 if(probability(s, candidate) > bestScore) 8 bestCandidate = candidate 9 bestScore = probability(s, candidate) 10 end if 11 end for 12 result = result + bestCandidate 13 end for 14 return result
In There are several disadvantages in this method, such as overlooking information in adjacent words, TL ordering, and TL usages. This method merely considers the probability of given SL word translated into some TL word, which completely ignores words' neighbors, which two or more of them can become a phrase. Another common issue in machine translation is the output ordering, for example, TL grammar dictates that subjects should be in front of verbs, while verbs go before subjects in SL grammar. The greedy method preserves the original ordering SL and can fail to reorder the words to produce a grammatically correct TL sentence. Still, this method provides a more concise way to build a simple machine translator, where rules need not to be specified. Charniak's modelIn Charniak's SMT model, more information is considered when choosing the best candidate for a translation. In Charniak's model, not only the relationship between SL and TL is considered but also the usage of the TL [2]. Charniak developed an English parser that describes the probability of specific word being which part of speech [4]. For example, given a word "said", the probability of it being followed by two consecutive nouns is low, while for the word "gave" the probability of it being followed by two consecutive nouns is high. This parser describes the usage of a language; in other words, it is easy to lookup if a word is used correctly by its probability. With this parser that describes a language, Charniak utilizes it to optimize the selection process during SMT. As described in the first section, SMT selects best candidate by \(\hat t = argmax_{t \in T} P(ts)\), and Charniak expanded this formula to \(\hat t = argmax_{t \in T} P(st)P(t)\). Since the distribution of \(P(s)\) is fixed once the input is given, two equations are equivalent. Recall that the parser is a function that gives the probability of a given word being which part of speech, the parser serves as \(P(t)\)in Charniak's SMT implementation. Yamada and Knight [5] published a translation model that takes an English parsing and outputs a translated Chinese sentence. Charniak reversed the process to improve the translation quality from Chinese to English. Instead of retrieving a Chinese sentence from an English parsing, all possible parsing that can yield the input Chinese sentence are retrieved. Charniak's SMT model then search among all the possible parsing for the highest probability. For example, a given Chinese sentence "你好嗎" can be the result from these parsing: "you good", "how are you", "how do you", etc., in Charniak's parser, P("how are you") should have the highest probability, making it the final output of the model. ExperimentA modified version of greedy SMT is implemented in this report, where in addition to considering the relationship between SL and TL, the adjacent words information and ordering issue are explored. In this experiment, English to Chinese mapping is retrieved from Taiwan Panorama organized by National Academy for Educational Research, and the usage of Chinese is obtained from [6]. After lemmatizing the input data, the core translation function can be abstracted as follows: 1 translate(S) 2 result := get_current_translation(S) 3 for i in S.length 4 //maximize the probability on both SLTL mapping and TL usage 5 result[i] := argmax_prob(result[i1], word, result[i+1], dictionary) 6 end for 7 for i in result.length 8 // swap if two TL words gives a higher probability when flipped 9 if(TL_prob(result[i] + result[i+1]) < TL_prob(result[i+1] + result[i]) 10 result[i], result[i+1] := result[i+1], result[i] 11 end if 12 end for 13 return result In the beginning, the model will check if there is an available translation to improve on, and when the SL sentence is not available, it simply translates the input by considering the SLTL relationship. With an initial translation to work on, the model does the following: $$argmax_{t \in T} P(t_is_i) + P(t_{i1}, t_i, t_{i+1})$$ The former term describes the probability of a SL word \(s_i\) being translated into the TL word \(t_i\), and the latter term defines how probable [\(t_{i1}\) (the TL word before \(t_i\)), \(t_i\), \(t_{i+1}\)(the TL word after \(t_i\))] is in TL usage. To address the ordering problem, the program examines if swapping items in bigrams increases the bigram probability in TL usage in the loop in line 7. To describe line 7 mathematically, the loop can be illustrated as: $$argmax_{\pi = [t_i, t_{i+1}] \in T} P(\pi)$$
The function Two additional experiments are tested to improve translation quality: translate to empty and translate based on TL usage. Translate to empty
If a SL word can mean an empty string, there will be a slight probability Translate based on TL usageWith a slight probability, TL candidates can be selected in the following manner: given the previous translated TL vocabulary \(t_{i1}\), look at the TL usage and select the most common word that follows \(t_{i1}\), and make it the translation of \(s_i\). The translation accuracy is measured by simply comparing the prediction with the ground truth divided by the translation sentence length. The accuracy is within the range of 32% to 0%, but most of the results can convey the meaning of the original SL sentence. ConclusionSMT provides an easier way to implement machine translation compared to RBMT. SMT selects the most suitable translation over the TL space by considering either SL statistics, TL statistics, SLTL mapping, or all of them. In the experiment section, it shows that it is more powerful than RBMT, where it can translate more sentences without manually adding all the underlying rules and the ability to produce a more grammatically correct order. Still, translation quality from SMT is not optimal, as seen in the experiment in this report, and more advanced technique should be used to capture the complex transition from one language to another. References
