Equivalent Malay-Arabic Data Corpus Collection

This paper aims to introduce a search strategy and collecting comparable sentences of Arab-Malay corpus data. This method was introduced for the use of students, researchers and amateur translators to search and compare the structure of sentences in Arabic and Malay. The first stage is to collect data corpus with high impact titles from the press and must be able to enlarge the scope of study as stated by Maia (2003). The second stage is to search using the specified key words based on selected high-impact titles such as the Football World Cup year 2010 and 2014. Data search is by using Webcorp engine http://www.webcorp.org.uk/live/ corpus and also open database Google https://www.google.com. The third stage is to filter the data by using Aker et.al (2012) and Braschler's (1998) method based on similar story, related story and similar aspects. At the fourth stage every category is measured by Guidere's (2002) equivalence strength which is strong comparability (SC), medium (MC) and weak (WC). At the last stage comparable sentences between the two languages are compiled in parallel according to Mona Baker’s (1992) level of grouping which are sentence level, combination of words, grammatical, pragmatic and textual level. The result from data analysis based on Mona Baker and Vinay & Darbelnet’s (1995) comparable theory proved the existence of some sentences in large quantities are on the same level of comparability from the point of information delivery. This can be used as the basis of additional evidence concerning the validity of 'universal theory.' in the science of translation.


Introduction
The effort to develop Arabic-Malay data corpus in Malaysia is still at the beginning.The search strategy and the collection of comparable data by using data corpus of open source is proposed in order to save time and effort as compared to the old method usually done by students, amateur translators and teachers.This strategy is also expected to be used to develop specific software which is online sentence dictionary that functions as a dictionary of sentences featuring on-screen display of comparable sentences between two languages or more.This can be said as an effort to assist and improve the use of dictionaries in the schools and universities that typically offer translation word by giving examples of usage only.
Among other corpus projects are Corpus of Contemporary Arabic (CCA) by Latifa Al-Sulaiti & Eric Atwell (2006) as well as Arabic-Dutch corpus (Vertaalwoordenboek Arabisch-Nederlands) by Mark Van Mol.According to Mark (2002) corpus developed by him since 1996, consists of two types, one based on word, containing 26,000 entries and another based on sentences, containing 4,000,000 entries which have been tagged, taken from over 1,200,000 various sources of discourses.By developing such corpuses has helped them to improve tools and methods of teaching Arabic in their country .
The need for Arab-Malay corpus is very important, especially for the development of the Arabic language in Malaysia.This study will contribute to the efforts of building and developing the Arabic data corpus that has a parallel translation in Malay.By applying the theory of lexicography and computer corpus analysis has systematized the compilation of dictionary with latest methods.In addition, it also can create an effective machine translation that can help translators expedite their business.
Translation studies of Arabic-Malay and vise-versa in Malaysia using corpus data has been simplified and mobilized effectively as the basic of corpus usage and its result had been proven successful by European researchers.By using this method, all of the latest data of the language usage are stored.The saved corpus data can be reached through concordance system that allows researchers to see the usage of language and its estimated latest meaning.As specified by St. John (2001), concordance is a display of lines of words or combination of words in a context that is removed from corpus text.Researchers must use the keyword search for certain words, so the desired data search can be generated.

OBJECTIVE
This paper aims to explain the steps needed to be done by students and amateur translators in finding comparable data on Arabic-Malay using the search engine of open source corpus.Every step taken is based on the methods and theories that have been introduced by researchers and experts in the field of translation corpus.

RESEARCH SIGNIFICANCE
The study is expected to introduce a strategy of searching comparable sentences using search engine of open source corpus.This strategy is hoped to be used as a foundation in developing online Arabic-Malay sentence dictionary using text comparison by comparable method between Arabic and Malay languages known as bilingual comparable corpora.This strategy is also expected to become essential in proving the validity of 'universal' theory in translation that still need evidences from the findings of studies in language, corpus and translation.At the same time, it can be used as a methodology of teaching and learning in the classroom and home studies for students and amateur translators learning practical knowledge in translation.This study was also designed to help school and university students to solve problems in structuring sentences in Arabic, especially in the writing and translation courses.

LITERATURE REVIEW
According to Zanetten(1998), Rusli & Norhafizah (2001) and Kruger (2004), there are two types of corpus that can be used as items of study to replace the dictionary.The first, referred to as parallel corpora which compares the original with the translated text.The second, known as comparable bilingual corpus (comparable corpora) which compares the text in two different languages sharing the same topic.For example, some topics from world press reporting the news about important events in multiple languages (Li Shao and Hwee Tou Ng, 2004).
According to Rusli & Nurhafizah (2001), the DBP had made an attempt to build a database of Malay idiomatic and unidiomatic phrases based on actual usage of the language in the translated text.This database contains common phrases and regular expressions of source language (English) with its equivalent in the target language (Malay).Phrases and regular expressions with their equivalences are derived from parallel and comparable corpora.
In Europe, comparable corpus studies had started since 1990s.Comparable corpus is bilingual texts that are not parallel but interrelated and deliver redundant infomations derived from various webs such as news issued by news agencies such as CNN and BBC.Among the studies that utilize comparable corpus are such as made by Munteano and Marcu (2005) and Munteano (2006).
Various techniques have been introduced by those researchers such as Rapp (1995), made an assumption that comparable word that can be translated may appear in the same context though from unrelated texts.Rapp took 100 words with their translations representing the context as a vector representing the same event (co-occurrence vector).The result is, a matrix of the same event becomes more similar when the composition of the words in the matrix is similar in both languages.Aker et. al (2012) in collaboration with Google has established a simple technique to collect comparable corpus from the web.This is because the techniques introduced by former researchers such as Rapp (1999), Monteanu and Marcu (2002), Resnik (1999), Huang et.al. (2010), Talvensaari (2008), and others are time-consuming and requires a lot of resources.The objective of his research is to reduce the amount of time and resources.Previously, researchers have to go through three steps to collect and build a of comparable corpus, namely: First: by downloading the document from the list of titles of the two languages.The process of downloading the document takes a long time and has to go through many obstacles.
Second: the process of matching with comparable data,and thus the data extraction.Headlines beneficial to the study from various categories of the selected language are taken together with the time and date of the newscast.
Third is to divide the title into several entities in the source language and named after people, places or organizations.The next phase is the process of making document alignment to compare the titles of articles from collected corpora.If it is comparable to the actual article it is then downloaded to obtain the matching corpus.

BAKER'S EQUIVALENCE LEVEL
Mona Baker (1992) introduces five levels of equivalence.The first is word level.This level exists in almost all languages of the world.One word represents one unit in searching equivalent meaning.
The second is combined words level (above word level).The equivalence is by combining words to give a meaning such as collocation.
The third is grammatical level where each language has different grammar.These differences pose problems in finding equivalent meaning directly in the translation.It also causes significant changes in how the message or information is transferred.These changes will cause an increase or decrease of information in a language.
The fourth is pragmatic level where the equivalence is from the aspect of coherence and interpretation process such as speech acts.Text should be evaluated based on the intention of the author in the target culture so that readers in target language can understand it.
The fifth is the textual level where the equivalence is on thematic structure and data (information and messages) and the use of cohesive tools such reference, connectors, replacement, ellipsis and lexical cohesion.

VINAY & DARBELNET EQUIVALENCE
Equivalence introduced by Vinay & Darbelnet (1995) can be said as different from the meaning of dynamic equivalence introduced by Nida & Taber.In Leonardi (2000), Vinay & Darbelnet concluded that equivalence is a process of reproducing the same situation as found in the original text, although phrased in a different language.If the procedure can be applied in the process of translating, it can mantain the effect of source language text style to the language of target text.This method is more ideal to find equivalent translation for proverbs, idioms, cliches, adjective phrases, and onomatopoeia of animal sounds.Vinay & Darbelnet (in Pym, 2010) had used natural method of translation in translating to maintain the same function of language with different terminology.This is referred to as culture adaptation of target language.Vinay & Darbelnet prioritize style effects compared to Nida who gave priority to the effect of message to the target user.Although they both state that the translation is a procedure that is based on equivalence or equivalence-oriente which is a process of replicating a similar situation with the original text with different words differ (Leonardi, 2000); but they argued that semantic meaning in the dictionaries is not enough to help produce a successful translation.
Thus the theory is seen as fit to be applied as a method in this study because it maintains the effect of culture and style of the text.Therefore, students will be able to understand the true form of language structure.

STRATEGY
The study will gather topics that have a probability of sharing the same information both in Malay and Arabic.The search for comparable meaning will be assisted by an open corpus online using corpus search engines of Webcorp http://www.webcorp.org.uk/live/besides Google database search engine; https://www.google.com.
Ratings are based on heuristic Aker's techniques (2012), the TS (title similarity), HS (time difference), and TLD (title length difference) when used in a combination of TS-HS-TLD and Braschler &Schäuble (1998) for the same story, related story and similar aspects.Each category is then measured the highest level of comparability of three levels as recommended by the Guidere (2002) as strong equivalence, medium equivalence and weak equivalence.
Figure 1 shows a general overview of these steps:

Third
Filtering the data by using Aker et.al. method (2012) and Braschler (1998) which is; same story, related story and similar aspects.
Fifth Sorting comparable verses between the two languages in parallel according to the Mona Baker (1992)'s rank order of five levels; word level, combination of words level, grammatical level, textual level, and pragmatic level study will have a problem to find an equivalent texts if the searched topic do not dominate the news in general as the equivalence score is not enough to be relied as equivalent due to the lack of data collected.On the other hand, the topics that dominated the discussion in every major newspaper world will open up a wider debate, as defined by Maia ( 2003), thus triggering the levels of new language usage and many terms related to this topic will appear.This study chooses football from sports genre as the topic particularly the World Cup in 2010 and 2014 because of the wider impact and being the headlines around the world.
Through general inspection of several titles based on their importance and impact in the headline of daily newspapers.

Figure 1 STEPS
Figure 1 STEPS In particular, the search and compilation steps can be described as follows: a) Identifying data:Source texts are taken from any sources that publish the article, document or text in both Arabic and English.The selected title is a topic that dominated the world news in a particular time whether in politics, economy, war and general genres.The

Final
Sepanyol juara, sotong gurita juga cipta sejarah Andres Iniesta wira Sepanyol Sepanyol sekat hajat Marwijk kubur "Total Football" buat selama-lamanya.c) Filtering data: Among the titles preferred are World Cup Final Match 2010, match between top teams and World Cup Final Match 2014.b)Finding equivalent article:This step comes after determining the topic in the genre of football.The task of searching articles is by utilizing the Google and Webcorp search engine.Based on the big topic of the final match of the 2010 World Cup, the researcher can choose a few key words such as 'World Cup 2010 tournament,' 'final match', 'Spain champion', 'Spain beat the Netherlands' and 'winning goal'.In the same time making the timing of the match between July 10, 2010 to July 13, 2010.Also setting the place, which is 'Malaysia' to limit the scope of the language in Malay only.Among the titles that were achieved by the search engine in Arabic are: