اصل این کتاب که قابل خرید است،‌ اما نسخه‌ای از این کتاب و همین طور کتاب پردازش متن با لینگ‌پایپ را می‌توانید از پیوند زیر دریافت کنید:

دریافت

 


برچسب‌ها: کتاب, ابزارهای پردازشی
+ نوشته شده توسط محمّد صادق رسولی در یکشنبه یازدهم خرداد 1393 و ساعت 23:20 |
این مقالات به تازگی در LREC منتشر شده که بر روی زبان فارسی به صورت مستقل یا بخشی از کار متمرکز شده است.

A Persian Treebank with Stanford Typed Dependencies
Extending the Coverage of a MWE Database for Persian CPs Exploiting Valency Alternations
Converting an HPSG-based Treebank into its Parallel Dependency-based Treebank
The CMU METAL Farsi NLP Approach
HamleDT 2.0: Thirty Dependency Treebanks Stanfordized


برچسب‌ها: مقاله, پردازش زبان فارسی
+ نوشته شده توسط محمّد صادق رسولی در شنبه سوم خرداد 1393 و ساعت 22:53 |
این نوشته قرار است فصلی از کتاب Oxford Handbook of Inflection بشود.


Katya Pertsova. "Machine learning of inflection," to appear in the Oxford Handbook of Inflection, M. Baerman (ed.).


برچسب‌ها: تحلیل ساخت‌واژی, یادگیری خودکار, کتاب
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه شانزدهم اردیبهشت 1393 و ساعت 22:45 |
این مقاله قرار است در همایش ACL-2014 منتشر شود:

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), 2014. [pdf] [code]



برچسب‌ها: مقاله, ابزارهای پردازشی
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه شانزدهم اردیبهشت 1393 و ساعت 22:41 |

این مقاله قرار است در همایش ACL-2014 منتشر شود. کار اصلی این مقاله تولید کلمات جدید زبان از روی داده‌های خام و بدون هیچ گونه اطلاعات زبانی دیگر است. این زمینهٔ تحقیقاتی خیلی نو است و تعداد مقالاتی که در این زمینه کار کرده‌اند بسیار کم است.


Mohammad Sadegh Rasooli, Thomas Lippincott, Nizar Habash, and Owen Rambow. Unsupervised Morphology-Based Vocabulary Expansion. The 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, Maryland, USA, June 2014.


Abstract

We present a novel way of generating unseen words, which is useful for certain applications such as automatic speech recognition or optical character recognition in low-resource languages. We test our vocabulary generator on seven low-resource languages by measuring the decrease in out-of-vocabulary word rate on a held-out test set. The languages we study have very different morphological properties; we show how our results differ depending on the morphological complexity of the language. In our best result (on Assamese), our approach can predict 29% of the token-based out-of-vocabulary with a small amount of unlabeled training data.



برچسب‌ها: مقاله, یادگیری بی‌ناظر, تحلیل ساخت‌واژی, تولید واژه, ماشین حالت
+ نوشته شده توسط محمّد صادق رسولی در پنجشنبه چهارم اردیبهشت 1393 و ساعت 4:8 |

این‌ها نوشته‌هایی به نسبت غیررسمی از استادان معتبر در پردازش زبان هستند که به نظرم جالب است نگاهی به آن‌ها بیندازید:


* "Write the Paper First" by Jason Eisner (Johns Hopkins University)

* "Writing clear and concise sentences", by Sharon Goldwater (Edinburgh University)

* "Some advice on writing well for NLP", by Philip Resnik (University of Maryland College Park)


برچسب‌ها: مقاله, مقاله‌نویسی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:49 |
این مقاله قرار است در همایش LREC 2014 منتشر شود.


Rudolf Rosa, Jan Mašek, David Mareček, Martin Popel, Daniel Zeman, and Zdeněk Žabokrtský. HamleDT 2.0: Thirty Dependency Treebanks Stanfordized. LREC 2014.


Abstract

We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion. 

We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future. 

We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline.


برچسب‌ها: مقاله, دادگان زبانی, نحو
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:42 |
این مقاله قرار است در همایش LREC 2014 منتشر شود.

Seraji, M., Megyesi, B., and Nivre, J. 2014. A Persian Treebank with Stanford Typed Dependencies. In Proceedings of Language Resources and Evaluation, LREC 2014.


Abstract

We present the Uppsala Persian Dependency Treebank (UPDT) with a syntactic annotation scheme based on Stanford Typed Dependencies. The treebank consists of 6,000 sentences and 151,671 tokens with an average sentence length of 25 words. The data is from different genres, including newspaper articles and fiction, as well as technical descriptions and texts about culture and art, taken from the open source Uppsala Persian Corpus (UPC). The syntactic annotation scheme is extended for Persian to include all syntactic relations that could not be covered by the primary scheme developed for English. In addition, we present open source tools for automatic analysis of Persian containing a text normalizer, a sentence segmenter and tokenizer, a part-of-speech tagger, and a parser. The treebank and the parser have been developed simultaneously in a bootstrapping procedure. The result of a parsing experiment shows an overall labeled attachment score of 82.05% and an unlabeled attachment score of 85.29%. The treebank is freely available as an open source resource.


برچسب‌ها: مقاله, دادگان زبانی, نحو, پردازش زبان فارسی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:36 |

این مقاله به تازگی در مجلهٔ Language resources and evaluation منتشر شده است.


Kemal Oflazor. Turkish and its challenges for language processing. LRE 2014.


Abstract

We present a short survey and exposition of some of the important aspects of Turkish that have proven challenging for natural language processing. Most of the challenges stem from the complex morphology of Turkish and how morphology interacts with syntax. We also provide a short overview of the major tools and resources developed for Turkish natural language processing over the last two decades.



برچسب‌ها: مقاله, پردازش زبان ترکی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:31 |

این کتاب را به تازگی انتشارات مرگان کلی‌پول منتشر کرده است.

Philipp Cimiano, Christina Unger and John McCraeOntology-Based Interpretation of Natural Language


Abstract

For humans, understanding a natural language sentence or discourse is so effortless that we hardly ever think about it. For machines, however, the task of interpreting natural language, especially grasping meaning beyond the literal content, has proven extremely difficult and requires a large amount of background knowledge. This book focuses on the interpretation of natural language with respect to specific domain knowledge captured in ontologies. The main contribution is an approach that puts ontologies at the center of the interpretation process. This means that ontologies not only provide a formalization of domain knowledge necessary for interpretation but also support and guide the construction of meaning representations.


We start with an introduction to ontologies and demonstrate how linguistic information can be attached to them by means of the ontology lexicon model lemon. These lexica then serve as basis for the automatic generation of grammars, which we use to compositionally construct meaning representations that conform with the vocabulary of an underlying ontology. As a result, the level of representational granularity is not driven by language but by the semantic distinctions made in the underlying ontology and thus by distinctions that are relevant in the context of a particular domain. We highlight some of the challenges involved in the construction of ontology-based meaning representations, and show how ontologies can be exploited for ambiguity resolution and the interpretation of temporal expressions. Finally, we present a question answering system that combines all tools and techniques introduced throughout the book in a real-world application, and sketch how the presented approach can scale to larger, multi-domain scenarios in the context of the Semantic Web.


اگر به کتاب دسترسی ندارید با بنده تماس بگیرید rasooli{AT}cs.columbia{DOT}edu



برچسب‌ها: کتاب, هستان‌شناسی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:26 |