نسخهٔ‌سوم این دادهٔ‌ چندزبانه منتشر شده است. این داده به صورت خودکار و برای ۲۷۱ زبان مختلف تهیه شده است.

http://babelnet.org/

 


برچسب‌ها: دادگان زبانی, شبکهٔ معنایی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیست و ششم آذر 1393 و ساعت 20:54 |
این مقاله به تازگی در مجلهٔ Computer Speech and Language منتشر شده است.

 

Saeed Farzy and Heshaam Faili, A swarm-inspired re-ranker system for statistical machine translation, Computer Speech & Languageو Volume 29, Issue 1, January 2015, Pages 45–62.

Abstract
Recently, re-ranking algorithms have been successfully applied on statistical machine translation systems. Due to the errors in the hypothesis alignment and varying word order between the source and target sentences and also the lack of sufficient resources such as parallel corpora, decoding may result in ungrammatical or non-fluent outputs. This paper proposes a re-ranking system based on swarm algorithms, which makes the use of sophisticated non-syntactical features to re-rank the n-best translation candidates. We introduce plenty of easy-computed non-syntactical features to deal with SMT system errors plus the quantum-behaved particle swarm optimization (QPSO) algorithm to adjust the weights of features. We have evaluated the proposed approach on 2 translation tasks in different language pairs (Persian → English and German → English) and genres (news and novel books). In comparison with PSO-, GA-, Perceptron- and averaged Perceptron-style re-ranking systems, the experimental study demonstrates the superiority of the proposed system in terms of translation quality on both translation tasks. In addition, the impacts of the proposed features on the translation quality have been analyzed, and the most positive ones have been recognized. At the end, the impact of the n-best list size on the proposed system is investigated.

 


برچسب‌ها: مقاله, پردازش زبان فارسی, ترجمهٔ خودکار
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه بیست و هفتم آبان 1393 و ساعت 0:23 |

چند سالی است که شبکه‌های عصبی تحت عنوان یادگیری عمیق در پردازش زبان مورد توجه و اقبال عمومی قرار گرفته‌اند [پیوند]. این پایان‌نامه شاید مهم‌ترین پایان‌نامه‌ای باشد که در این زمینه منتشر شده است و حاوی نوآوری‌های مختلفی در این زمینه است.

Recursive Deep Learning for Natural Language Processing and Computer Vision, Richard Socher
PhD Thesis, Computer Science Department, Stanford University


برچسب‌ها: پایان‌نامه, شبکه‌های عصبی, یادگیری عمیق, یادگیری خودکار
+ نوشته شده توسط محمّد صادق رسولی در پنجشنبه هشتم آبان 1393 و ساعت 1:23 |
ویرایش دوم این کتاب، به تازگی منتشر شده است:

Hang Li, Learning to Rank for Information Retrieval and Natural Language Processing, Second Edition, Synthesis Lectures on Human Language Technologies, October 2014, Morgan & Claypool Publishers.

Abstract
Learning to rank refers to machine learning techniques for training a model in a ranking task. Learning to rank is useful for many applications in information retrieval, natural language processing, and data mining. Intensive studies have been conducted on its problems recently, and significant progress has been made. This lecture gives an introduction to the area including the fundamental problems, major approaches, theories, applications, and future work. The author begins by showing that various ranking problems in information retrieval and natural language processing can be formalized as two basic ranking tasks, namely ranking creation (or simply ranking) and ranking aggregation. In ranking creation, given a request, one wants to generate a ranking list of offerings based on the features derived from the request and the offerings. In ranking aggregation, given a request, as well as a number of ranking lists of offerings, one wants to generate a new ranking list of the offerings. Ranking creation (or ranking) is the major problem in learning to rank. It is usually formalized as a supervised learning task. The author gives detailed explanations on learning for ranking creation and ranking aggregation, including training and testing, evaluation, feature creation, and major approaches. Many methods have been proposed for ranking creation. The methods can be categorized as the pointwise, pairwise, and listwise approaches according to the loss functions they employ. They can also be categorized according to the techniques they employ, such as the SVM based, Boosting based, and Neural Network based approaches. The author also introduces some popular learning to rank methods in details. These include: PRank, OC SVM, McRank, Ranking SVM, IR SVM, GBRank, RankNet, ListNet & ListMLE, AdaRank, SVM MAP, SoftRank, LambdaRank, LambdaMART, Borda Count, Markov Chain, and CRanking. The author explains several example applications of learning to rank including web search, collaborative filtering, definition search, keyphrase extraction, query dependent summarization, and re-ranking in machine translation. A formulation of learning for ranking creation is given in the statistical learning framework. Ongoing and future research directions for learning to rank are also discussed.

 

اگر به کتاب دسترسی ندارید با بنده تماس بگیرید rasooli{AT}cs.columbia{DOT}edu

 


برچسب‌ها: کتاب, رتبه‌بندی
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه ششم آبان 1393 و ساعت 2:43 |
این مقاله در کارسوق فناوری زبان برای زبان‌های نزدیک به هم در EMNLP'14 منتشر شده است:

 


Maryam Aminian, Mahmoud Ghoneim, and Mona Diab. Handling OOV Words in Dialectal Arabic to English Machine Translation, Language Technology for Closely Related Languages and Language Variants (LT4CloseLang), pages 99–108, Qatar, 2014.

Abstract
Dialects and standard forms of a language typically share a set of cognates that could bear the same meaning in both varieties or only be shared homographs but serve as faux amis. Moreover, there are words that are used exclusively in the dialect or the standard variety. Both phenomena, faux amis and exclusive vocabulary, are considered out of vocabulary (OOV) phenomena. In this paper, we present this problem of OOV in the context of machine translation. We present a new approach for dialect to English Statistical Machine Translation (SMT) enhancement based on normalizing dialectal language into standard form to provide equivalents to address both aspects of the OOV problem posited by dialectal language use. We specifically focus on Arabic to English SMT. We use two publicly available dialect identification tools: AIDA and MADAMIRA, to identify and replace dialectal Arabic OOV words with their modern standard Arabic (MSA) equivalents. The results of evaluation on two blind test sets show that using AIDA to identify and replace MSA equivalents enhances translation results by 0.4% absolute BLEU (1.6% relative BLEU) and using MADAMIRA achieves 0.3% absolute BLEU (1.2% relative BLEU) enhancement over the baseline. We show our replacement scheme reaches a noticeable enhancement in SMT performance for faux amis words.


برچسب‌ها: مقاله, ترجمهٔ خودکار, پردازش زبان عربی
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه ششم آبان 1393 و ساعت 2:38 |
این مقاله به تازگی در مجلهٔ Literary and Linguistic Computing منتشر شده است:

 

Faili, Heshaam, Nava Ehsan, Mortaza Montazery, and Mohammad Taher Pilehvar. "Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language." Literary and Linguistic Computing (2014).

Abstract
With advancements in industry and information technology, large volumes of electronic documents such as newspapers, emails, weblogs, and theses are produced daily. Producing electronic documents has considerable benefits such as easy organizing and data management. Therefore, existence of automatic systems such as spell and grammar-checker/correctors can help to improve their quality. In this article, the development of an automatic spelling, grammatical and real-word error checker for Persian (Farsi) language, named Vafa Spell-Checker, is explained. Different kinds of errors in a text can be categorized into spelling, grammatical, and real-word errors. Vafa Spell-Checker is a hybrid system in which both rule-based and statistical approaches are used to detect/correct whole types of errors. The detection and correction phases of spelling and real-word errors are fully statistical, while for the grammar-checker, a rule-based approach is proposed. Vafa Spell-Checker attempts to process these kinds of error types in an integrated system for Persian language. The results on the real-world collected test set indicate that continuing the work on grammar-checker requires statistical approaches. Evaluation results with respect to F0.5 measure for spell-checker, grammar-checker, and real-word error checker are about 0.908, 0.452, and 0.187, respectively. Moreover, several free-usable language resources for Persian that are generated during this project are demonstrated in this article. These resources could be used in the further research in Persian language.

 

 


برچسب‌ها: مقاله, پردازش زبان فارسی, خطایابی املایی, خطایابی دستوری
+ نوشته شده توسط محمّد صادق رسولی در پنجشنبه سوم مهر 1393 و ساعت 20:6 |
این مقاله به تازگی در مجلهٔ Literary and Linguistic Computing منتشر شده است:

Saeedi, Parisa, Heshaam Faili, and Azadeh Shakery. "Semantic role induction in Persian: An unsupervised approach by using probabilistic models." Literary and Linguistic Computing (2014).

Abstract

Semantic roles describe the relation between a predicate (typically a verb) and its arguments. Semantic role labeling is a Natural Language Processing task that extracts these relations in the sentences. Different applications such as machine translation and question answering benefit from this level of semantic analysis. The creation of semantic role-annotated data is an obstacle to develop supervised learning systems, so we present a novel unsupervised approach to semantic role induction task. In our approach, which is formulized as a clustering method, the argument instances of the verb are clustered into semantic role classes specified for that verb. We present a Bayesian model for learning argument structure from un-annotated text and estimate the model parameters using expectation maximization method. Clustering of argument instances of a verb, which have semantic and syntactic similarities, can be a promising approach for unsupervised learning of their semantic roles. The only linguistic knowledge, which is prepared for linking the argument instances to semantic clusters is extracted from a verb valance lexicon. Our evaluation results on Persian language show that our system in both small and large training datasets works better than a strong baseline proposed by (Lang and Lapata 2010) which its idea is developed in Persian. We have used purity and inverse purity measures to assess the quality of the proposed semantic role clustering method. The results indicate the improvement about 9.73 and 1.65% in small dataset and 2.85 and 0.67% in large dataset in purity and inverse purity, respectively.


برچسب‌ها: مقاله, پردازش زبان فارسی, معناشناسی
+ نوشته شده توسط محمّد صادق رسولی در پنجشنبه سوم مهر 1393 و ساعت 20:3 |

تجزیهٔ مقادیر تکین (Singular Value Decomposition (SVD) یکی از پرکاربردترین روش‌های ریاضیاتی مبتنی بر اعمال ماتریسی است که بیشتر در کاهش ابعاد ویژگی‌ها در یادگیری یا نگاشت ابعاد فضای مسألهٔ یادگیری به فضای جدید با ابعاد کمتر به کار می‌رود. این جزوه به شکلی خیلی ساده این مسأله را آموزش داده است و با مثالی ساده به صورت گام به گام مسأله را تشریح کرده است.

Kirk Baker, "Singular Value Decomposition Tutorial".


برچسب‌ها: منابع آموزشی, جبر خطی, کاهش ابعاد, یادگیری خودکار, ریاضی
+ نوشته شده توسط محمّد صادق رسولی در جمعه بیست و یکم شهریور 1393 و ساعت 8:33 |
این کتاب را به تازگی انتشارات اشپرینگر منتشر کرده است:

دریافت کتاب

Nugues, Pierre M. "Language Processing with Perl and Prolog", 2014.

This book teaches the principles of natural language processing, first covering practical linguistics issues such as encoding and annotation schemes, defining words, tokens and parts of speech, and morphology, as well as key concepts in machine learning, such as entropy, regression, and classification, which are used throughout the book. It then details the language-processing functions involved, including part-of-speech tagging using rules and stochastic techniques, using Prolog to write phase-structure grammars, syntactic formalisms and parsing techniques, semantics, predicate logic, and lexical semantics, and analysis of discourse and applications in dialogue systems. A key feature of the book is the author's hands-on approach throughout, with sample code in Prolog and Perl, extensive exercises, and a detailed introduction to Prolog. The reader is supported with a companion website that contains teaching slides, programs, and additional material.

 

 

اگر به کتاب دسترسی ندارید با بنده تماس بگیرید rasooli{AT}cs.columbia{DOT}edu

 


برچسب‌ها: کتاب, زبان برنامه‌نویسی, پرل, پرولوگ
+ نوشته شده توسط محمّد صادق رسولی در جمعه بیست و چهارم مرداد 1393 و ساعت 8:20 |

اصل این کتاب که قابل خرید است،‌ اما نسخه‌ای از این کتاب و همین طور کتاب پردازش متن با لینگ‌پایپ را می‌توانید از پیوند زیر دریافت کنید:

دریافت

 


برچسب‌ها: کتاب, ابزارهای پردازشی
+ نوشته شده توسط محمّد صادق رسولی در یکشنبه یازدهم خرداد 1393 و ساعت 23:20 |