این مقاله به تازگی در مجلهٔ Literary and Linguistic Computing منتشر شده است:


Faili, Heshaam, Nava Ehsan, Mortaza Montazery, and Mohammad Taher Pilehvar. "Vafa spell-checker for detecting spelling, grammatical, and real-word errors of Persian language." Literary and Linguistic Computing (2014).

With advancements in industry and information technology, large volumes of electronic documents such as newspapers, emails, weblogs, and theses are produced daily. Producing electronic documents has considerable benefits such as easy organizing and data management. Therefore, existence of automatic systems such as spell and grammar-checker/correctors can help to improve their quality. In this article, the development of an automatic spelling, grammatical and real-word error checker for Persian (Farsi) language, named Vafa Spell-Checker, is explained. Different kinds of errors in a text can be categorized into spelling, grammatical, and real-word errors. Vafa Spell-Checker is a hybrid system in which both rule-based and statistical approaches are used to detect/correct whole types of errors. The detection and correction phases of spelling and real-word errors are fully statistical, while for the grammar-checker, a rule-based approach is proposed. Vafa Spell-Checker attempts to process these kinds of error types in an integrated system for Persian language. The results on the real-world collected test set indicate that continuing the work on grammar-checker requires statistical approaches. Evaluation results with respect to F0.5 measure for spell-checker, grammar-checker, and real-word error checker are about 0.908, 0.452, and 0.187, respectively. Moreover, several free-usable language resources for Persian that are generated during this project are demonstrated in this article. These resources could be used in the further research in Persian language.



برچسب‌ها: مقاله, پردازش زبان فارسی, خطایابی املایی, خطایابی دستوری
+ نوشته شده توسط محمّد صادق رسولی در پنجشنبه سوم مهر 1393 و ساعت 20:6 |
این مقاله به تازگی در مجلهٔ Literary and Linguistic Computing منتشر شده است:

Saeedi, Parisa, Heshaam Faili, and Azadeh Shakery. "Semantic role induction in Persian: An unsupervised approach by using probabilistic models." Literary and Linguistic Computing (2014).


Semantic roles describe the relation between a predicate (typically a verb) and its arguments. Semantic role labeling is a Natural Language Processing task that extracts these relations in the sentences. Different applications such as machine translation and question answering benefit from this level of semantic analysis. The creation of semantic role-annotated data is an obstacle to develop supervised learning systems, so we present a novel unsupervised approach to semantic role induction task. In our approach, which is formulized as a clustering method, the argument instances of the verb are clustered into semantic role classes specified for that verb. We present a Bayesian model for learning argument structure from un-annotated text and estimate the model parameters using expectation maximization method. Clustering of argument instances of a verb, which have semantic and syntactic similarities, can be a promising approach for unsupervised learning of their semantic roles. The only linguistic knowledge, which is prepared for linking the argument instances to semantic clusters is extracted from a verb valance lexicon. Our evaluation results on Persian language show that our system in both small and large training datasets works better than a strong baseline proposed by (Lang and Lapata 2010) which its idea is developed in Persian. We have used purity and inverse purity measures to assess the quality of the proposed semantic role clustering method. The results indicate the improvement about 9.73 and 1.65% in small dataset and 2.85 and 0.67% in large dataset in purity and inverse purity, respectively.

برچسب‌ها: مقاله, پردازش زبان فارسی, معناشناسی
+ نوشته شده توسط محمّد صادق رسولی در پنجشنبه سوم مهر 1393 و ساعت 20:3 |

تجزیهٔ مقادیر تکین (Singular Value Decomposition (SVD) یکی از پرکاربردترین روش‌های ریاضیاتی مبتنی بر اعمال ماتریسی است که بیشتر در کاهش ابعاد ویژگی‌ها در یادگیری یا نگاشت ابعاد فضای مسألهٔ یادگیری به فضای جدید با ابعاد کمتر به کار می‌رود. این جزوه به شکلی خیلی ساده این مسأله را آموزش داده است و با مثالی ساده به صورت گام به گام مسأله را تشریح کرده است.

Kirk Baker, "Singular Value Decomposition Tutorial".

برچسب‌ها: منابع آموزشی, جبر خطی, کاهش ابعاد, یادگیری خودکار, ریاضی
+ نوشته شده توسط محمّد صادق رسولی در جمعه بیست و یکم شهریور 1393 و ساعت 8:33 |
این کتاب را به تازگی انتشارات اشپرینگر منتشر کرده است:

دریافت کتاب

Nugues, Pierre M. "Language Processing with Perl and Prolog", 2014.

This book teaches the principles of natural language processing, first covering practical linguistics issues such as encoding and annotation schemes, defining words, tokens and parts of speech, and morphology, as well as key concepts in machine learning, such as entropy, regression, and classification, which are used throughout the book. It then details the language-processing functions involved, including part-of-speech tagging using rules and stochastic techniques, using Prolog to write phase-structure grammars, syntactic formalisms and parsing techniques, semantics, predicate logic, and lexical semantics, and analysis of discourse and applications in dialogue systems. A key feature of the book is the author's hands-on approach throughout, with sample code in Prolog and Perl, extensive exercises, and a detailed introduction to Prolog. The reader is supported with a companion website that contains teaching slides, programs, and additional material.



اگر به کتاب دسترسی ندارید با بنده تماس بگیرید rasooli{AT}cs.columbia{DOT}edu


برچسب‌ها: کتاب, زبان برنامه‌نویسی, پرل, پرولوگ
+ نوشته شده توسط محمّد صادق رسولی در جمعه بیست و چهارم مرداد 1393 و ساعت 8:20 |

اصل این کتاب که قابل خرید است،‌ اما نسخه‌ای از این کتاب و همین طور کتاب پردازش متن با لینگ‌پایپ را می‌توانید از پیوند زیر دریافت کنید:



برچسب‌ها: کتاب, ابزارهای پردازشی
+ نوشته شده توسط محمّد صادق رسولی در یکشنبه یازدهم خرداد 1393 و ساعت 23:20 |
این مقالات به تازگی در LREC منتشر شده که بر روی زبان فارسی به صورت مستقل یا بخشی از کار متمرکز شده است.

A Persian Treebank with Stanford Typed Dependencies
Extending the Coverage of a MWE Database for Persian CPs Exploiting Valency Alternations
Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank
The CMU METAL Farsi NLP Approach
HamleDT 2.0: Thirty Dependency Treebanks Stanfordized

برچسب‌ها: مقاله, پردازش زبان فارسی
+ نوشته شده توسط محمّد صادق رسولی در شنبه سوم خرداد 1393 و ساعت 22:53 |
این نوشته قرار است فصلی از کتاب Oxford Handbook of Inflection بشود.

Katya Pertsova. "Machine learning of inflection," to appear in the Oxford Handbook of Inflection, M. Baerman (ed.).

برچسب‌ها: تحلیل ساخت‌واژی, یادگیری خودکار, کتاب
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه شانزدهم اردیبهشت 1393 و ساعت 22:45 |
این مقاله قرار است در همایش ACL-2014 منتشر شود:

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), 2014. [pdf] [code]

برچسب‌ها: مقاله, ابزارهای پردازشی
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه شانزدهم اردیبهشت 1393 و ساعت 22:41 |

این مقاله قرار است در همایش ACL-2014 منتشر شود. کار اصلی این مقاله تولید کلمات جدید زبان از روی داده‌های خام و بدون هیچ گونه اطلاعات زبانی دیگر است. این زمینهٔ تحقیقاتی خیلی نو است و تعداد مقالاتی که در این زمینه کار کرده‌اند بسیار کم است.

Mohammad Sadegh Rasooli, Thomas Lippincott, Nizar Habash, and Owen Rambow. Unsupervised Morphology-Based Vocabulary Expansion. The 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, Maryland, USA, June 2014.


We present a novel way of generating unseen words, which is useful for certain applications such as automatic speech recognition or optical character recognition in low-resource languages. We test our vocabulary generator on seven low-resource languages by measuring the decrease in out-of-vocabulary word rate on a held-out test set. The languages we study have very different morphological properties; we show how our results differ depending on the morphological complexity of the language. In our best result (on Assamese), our approach can predict 29% of the token-based out-of-vocabulary with a small amount of unlabeled training data.

برچسب‌ها: مقاله, یادگیری بی‌ناظر, تحلیل ساخت‌واژی, تولید واژه, ماشین حالت
+ نوشته شده توسط محمّد صادق رسولی در پنجشنبه چهارم اردیبهشت 1393 و ساعت 4:8 |

این‌ها نوشته‌هایی به نسبت غیررسمی از استادان معتبر در پردازش زبان هستند که به نظرم جالب است نگاهی به آن‌ها بیندازید:

* "Write the Paper First" by Jason Eisner (Johns Hopkins University)

* "Writing clear and concise sentences", by Sharon Goldwater (Edinburgh University)

* "Some advice on writing well for NLP", by Philip Resnik (University of Maryland College Park)

برچسب‌ها: مقاله, مقاله‌نویسی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:49 |