Open Source Corpus as a Tool for Translation Training

Building a sentence into Arabic is rather difficult for amateur translators. Similarly,is the case for Malay students who particularly learn how to build sentences in writing. Usage of dictionaries also is not enough to convey the translation, especially in translating phrases and sentences from the Malay language into Arabic. Students are incapable of building sentences in Arabic because of lack of exposure to the structure of Arabic sentences. This weakness is discovered by most schools and universities in their writing exercises (Rosni, 2012), Ab. Halim Mohamad (2009), Che Radiah (2009). Generally, the dictionary is very suitable to be used in the search for meaning in the words but not the meaning of the sentence. This paper proposes a method of comparing comparable text of both languages through comparable corpora of both. It can also be called as a tool for translators. In addition to using the dictionary, students are guided to understand the structure of the original Arabic sentences with the comparative method, then apply it in the form of a writing exercise. In this process, teachers, students and amateur translators need to use the computer as a tool and open access data corpus in websites as the ingredient. Translated texts or guide texts for writing exercises are based on Aker and colleagues (2012) method of selection. Text is filtered using Webcorp open corpus engine http://www.webcorp.org.uk/live/ and also through Google open database https://www.google.com. Through this method, the search for similarities between the first and the second language can be exploited. Any text that is identified as having the closest comparable will be used in the classroom. It helps students and translators to build sentences into Arabic by comparison and evaluation of the original text in the corpus. At the same time students are also able to understand and recognize indirectly the structure of the original Arabic sentences. Hopefully this method will help amateur translators and students improve their quality of translation and writing in Arabic.


Introduction
This method was introduced since the widespread use of bilingual corpus.It is not a method of translation, but it is a method for finding comparable texts between two languages.The aim was to find meaning equivalence that can be a model translation that is near to original level of language usage.Mona Baker's theory of translation is used as measurement in determining comparability between the two texts.It is expected to be used as a method of learning in class for translation courses and also as a plan and structure to a software translation tool that is complemented with Malay and Arabic corpus data.The software will display examples of comparable Malay-Arab sentences as a guide to students studying correct Arabic sentence structures.

Background
One of the problems in schools and universities that need solution is the students' weakness in mastering the Arabic language, especially in writing and translation skills.Arabic language students experienced this problem since the school years and weakness was further brought to the university level.The problem becomes more apparent when focused on the weaknesses in the building of phrases and sentences which is the main mean for effective communication as found by studies such as Rosni (2012), Ab.Halim Mohamad (2009), Che Radiah (2009), Noor Anida binti Awang, Norhayati Binti Che Hat and Nurazan binti Mohmad Rouyan (2014) and Ghazali Yusri and Ahmad Bin Salleh (2006).
Based on the studies above, among the factors that lead to this weakness is due to the fact that the students are affected by the structure of their mother tongue.Other reasons are low mastery of Arabic vocabulary, negligency and no high motivation in learning.This weakness can be seen through significant mistakes in their writing and also in their translation texts from Malay into Arabic language.The usage of dictionary is not enough help to convey the translation's meaning, because the dictionary only translate words and phrases.The example given is also limited.Even if the students were able to find the meaning of each word but they still have trouble in structuring sentences into Arabic.This method is as a proposal to the development of a software which displays example sentences in both languages.The texts selected are appropriate to the needs of their essay.Students only need to search the sentences needed on the topic of their essay by keyword root word or phrase.Based on the limited ability of the dictionary, thus this process is intended to help students get to know and understand the Arabic sentence structure more easily through studying and comparing with the sentences they construct.

Objectives
This paper aims to introduce the method of comparing text meaning through comparable corpora of two languages, Malay and Arabic .The results are expected to be used as a tool in teaching and learning translation classes and as the basis for the construction of Malay -Arabic dictionary of sentences .

Research Significance
Through this method, it can be used to search for comparabilities in texts and sentences between the first language and the second.Each text identified as having immediate comparability will be used as teaching and learning tool in the classroom.It helps students to Arabic translators in structuring new sentences into Arabic by comparing and evaluating the original texts in the corpus.At the same time students are able to understand and recognize indirectly the structure of the original Arabic sentences.This method is expected to help the amateur translators and students to improve their translation and writing in Arabic.Among the advantages of this method, it offers greater data links rather than using manual methods and is expected to form the basis for constructing a data software with a collection of selected Malay and Arabic texts, placed at par.

Literature Review
According to Zanettin (1998), Rusli and Norhafizah (2001), and Kruger (2004), there are two types of corpus that can be used as study tools to replace the dictionaries.First, known as parallel corpus (parrallel corpora), a corpus which compares the original text with the translated text.The second, known as comparable bilingual corpus (comparable corpora), the corpus that compares the text in two different languages, but share the same topic.For example, some topics or newspaper headlines of the world reporting an important event in multiple languages (Li Shao and Tou Hwee Ng, 2004).
According to Rusli Abdul Ghani and Nurhafizah Mohamed Husin (2001), the DBP has made an effort to build a database of phrases in Malay whether idiomatic or not based on actual use of the phrases in their translation texts.This database includes common phrases and regular expressions in the source language (English) with its equivalent in the target language (Malay).Phrases and regular expression with its matches, are all derived from parallel and comparable corpora.
In Europe, comparable corpus studies began in the 1990s.Many studies concerning corpus were carried out.Comparable corpus as has been described is an unparallel bilingual texts but related and deliver a lot of overlap data in the web such as news in various languages released by news agencies such as CNN and BBC.Among the studies that utilize comparable corpus are studies by Munteano and Marcu (2005) and Munteano (2006).
Various techniques have been introduced by researchers such as Rapp (1995), have made the assumption that comparable words that can be translated appear in the same context, even in unrelated text.Rapp took 100 words and their translations representing the context as vector of similar incident (co-occurrence vector).The result is the matrix of the same events become more common when the composition of words in the matrix is the same in both languages.Talvensaari (2008), and others were time-consuming and requires substantial resources.The objective of his research is to reduce the amount of time and resources.Previously, researchers have to go through three steps to gather and build a comparable corpus, namely: First: by downloading the document from the list of titles of the two languages.The process of downloading the document takes a long time and have to go through many obstacles.
Second: is the process of matching with comparable data and the third is to extract them.However, with the proposed technique, the first and second steps become easy.This study used English, Greek and Germany corpus.The methodology is by making a search of news articles through webs and RSS feeds without having to download the entire document.Topics headlines that are beneficial to the study from various categories of the selected languages are taken and at the same time, the time and date of the newscast, URL articles and cluster URL Google News are all recorded.From the topic search and URL cluster, a total of 30 articles with headlines are collected and downloaded forming monolingual Google News search.This process is performed in a specified time period, ie within a period of one week so that only the latest news are taken.
Third, is to divide the title into several entities in the source language and named after people, places or organizations.It is then translated via Google translate to the target language.The next phase is the process of aligning the document to compare the titles of the articles from the collected corpora.If it is comparable, then the actual article is being downloaded to obtain the equivalent corpus.
According to Aker et. al (2012), to measure the equivalence of corpus, two titles were tested with various heuristic techniques.The best 'heuristic' technique is TS (similar title), HS (time difference), and TLD (title-length difference), when used in combination TS-HS-TLD.It was then assessed according to 'Kendall's rank order' and also through human judgment based on Braschler's comparison (1998), ie five categories: same story, related story, same aspect, similar terms or unrelated.The hypothesis is that when the two articles contain the same story.Some of the findings resulting from this comparison showed that parallel and comparable corpus can be used to build a database of phrases.However, due to the small size of the corpus leads only to few findings.From parallel corpus, some examples of phrases and expressions have been quoted, while from the comparable corpus only terms are available with no results of idiomatic phrase.Comparable phrases from different texts (and different translator) gives an indication that there is a consensus that assures that it is the most suitable match.In case of only having one phrase source with multimatching, researchers can make their own choices based on compatibility.

Methodology
This study will collect comparable data in Malay and Arabic.Data search and collection is through open corpus online using Webcorp search engine corpus http://www.webcorp.org.uk/live/ and Google https://www.google.comdatabase.The scope of this study is focused on general materials which are appropriate to the writing skills course beginning from level two of primary schools, all levels of secondary schools and Arabic writing skills courses in local university.
The selected topics are topics that dominate the debate of every major world newspaper which will open up a wider debate, as explained by Maia (2003), thus triggering the stages of new language usages and arising many terms related to this topic.
Data samples taken for this method is from sports genre under the topic of the World Cup Championship.This topic was chosen because of the importance of these topics covering the headlines, front and back pages of the newspapers.The probability score to achieve comparable text is greater.Topics for important matches especially the final always received wide coverage as it relates to the world's biggest hit in sports favoured by many.
Related data, evaluated by Aker's 'heuristic' technique (2012) is TS (title similarity), HS (time difference), and TLD (titlelength difference) when used in combination TS-HS-TLD and Braschler and Schäuble ( 1998) category which is same story, related story and same aspect.Each category is then measured of its strong comparability of three levels, as recommended by Guidere (2002) i.e. strong, medium and weak comparability.
Figure 1 shows an overview of the whole methodology of the study: Aker et.al (2012) in collaboration with Google has shown a simple technique to collect comparable corpus from the web.This is because the techniques introduced by the researchers before, such as Rapp (1999), Monteanu and Marcu (2002), Resnik (1999), Huang et.al.(2010),

Figure 1 RESEARCH
Figure 1