TS Corpus Project: An online Turkish Dictionary and TS DIY Corpus

  • Taner Sezer Mersin University


TS Corpus is a free and independent project that aims building Turkish corpora, NLP tools and linguistic datasets. Since 2011, 10 corpora, various NLP tools, a large dataset and an online dictionary has been released. This paper focuses on the “online dictionary” and “ TS do-it-yourself corpus” released by the project. The dictionary data is based on TDK (Turkish Language Society) Contemporary Dictionary. However, the dictionary published serves many enhanced functions at user interface level. But, the main importance of the study is about the results presented to the users upon their queries; the presentation of collocations and tri-grams of the key word searched for. The collocations are harvested from a large Turkish corpus, +760 million tokens and the tri-grams were generated from Turkish Wikipedia pages. The do-it-yourself corpus (TS DIY Corpus), allows users to build their own corpora, modify or delete the uploaded texts and run queries. Users may run queries in different modes, such as “as is”, “starting/ending with” or including; besides advanced query option allows users to run queries with part-of-speech tags and lemmas. The results are given in KWIC (keyword in context) format. Various text classification options such as pubdate, author, domain, genre etc. could be selected during corpus creation. As the number of available Turkish corpora is limited, TS DIY Corpus is applicant to be a useful, well-known and largely used software for the scholars and researchers who wants to use a Turkish corpus or study over Turkish texts of their own.
Oct 6, 2017
