X
تبلیغات
پردازش زبان طبیعی

این‌ها نوشته‌هایی به نسبت غیررسمی از استادان معتبر در پردازش زبان هستند که به نظرم جالب است نگاهی به آن‌ها بیندازید:


* "Write the Paper First" by Jason Eisner (Johns Hopkins University)

* "Writing clear and concise sentences", by Sharon Goldwater (Edinburgh University)

* "Some advice on writing well for NLP", by Philip Resnik (University of Maryland College Park)


برچسب‌ها: مقاله, مقاله‌نویسی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:49 |
این مقاله قرار است در همایش LREC 2014 منتشر شود.


Rudolf Rosa, Jan Mašek, David Mareček, Martin Popel, Daniel Zeman, and Zdeněk Žabokrtský. HamleDT 2.0: Thirty Dependency Treebanks Stanfordized. LREC 2014.


Abstract

We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion. 

We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future. 

We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline.


برچسب‌ها: مقاله, دادگان زبانی, نحو
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:42 |
این مقاله قرار است در همایش LREC 2014 منتشر شود.

Seraji, M., Megyesi, B., and Nivre, J. 2014. A Persian Treebank with Stanford Typed Dependencies. In Proceedings of Language Resources and Evaluation, LREC 2014.


Abstract

We present the Uppsala Persian Dependency Treebank (UPDT) with a syntactic annotation scheme based on Stanford Typed Dependencies. The treebank consists of 6,000 sentences and 151,671 tokens with an average sentence length of 25 words. The data is from different genres, including newspaper articles and fiction, as well as technical descriptions and texts about culture and art, taken from the open source Uppsala Persian Corpus (UPC). The syntactic annotation scheme is extended for Persian to include all syntactic relations that could not be covered by the primary scheme developed for English. In addition, we present open source tools for automatic analysis of Persian containing a text normalizer, a sentence segmenter and tokenizer, a part-of-speech tagger, and a parser. The treebank and the parser have been developed simultaneously in a bootstrapping procedure. The result of a parsing experiment shows an overall labeled attachment score of 82.05% and an unlabeled attachment score of 85.29%. The treebank is freely available as an open source resource.


برچسب‌ها: مقاله, دادگان زبانی, نحو, پردازش زبان فارسی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:36 |

این مقاله به تازگی در مجلهٔ Language resources and evaluation منتشر شده است.


Kemal Oflazor. Turkish and its challenges for language processing. LRE 2014.


Abstract

We present a short survey and exposition of some of the important aspects of Turkish that have proven challenging for natural language processing. Most of the challenges stem from the complex morphology of Turkish and how morphology interacts with syntax. We also provide a short overview of the major tools and resources developed for Turkish natural language processing over the last two decades.



برچسب‌ها: مقاله, پردازش زبان ترکی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:31 |

این کتاب را به تازگی انتشارات مرگان کلی‌پول منتشر کرده است.

Philipp Cimiano, Christina Unger and John McCraeOntology-Based Interpretation of Natural Language


Abstract

For humans, understanding a natural language sentence or discourse is so effortless that we hardly ever think about it. For machines, however, the task of interpreting natural language, especially grasping meaning beyond the literal content, has proven extremely difficult and requires a large amount of background knowledge. This book focuses on the interpretation of natural language with respect to specific domain knowledge captured in ontologies. The main contribution is an approach that puts ontologies at the center of the interpretation process. This means that ontologies not only provide a formalization of domain knowledge necessary for interpretation but also support and guide the construction of meaning representations.


We start with an introduction to ontologies and demonstrate how linguistic information can be attached to them by means of the ontology lexicon model lemon. These lexica then serve as basis for the automatic generation of grammars, which we use to compositionally construct meaning representations that conform with the vocabulary of an underlying ontology. As a result, the level of representational granularity is not driven by language but by the semantic distinctions made in the underlying ontology and thus by distinctions that are relevant in the context of a particular domain. We highlight some of the challenges involved in the construction of ontology-based meaning representations, and show how ontologies can be exploited for ambiguity resolution and the interpretation of temporal expressions. Finally, we present a question answering system that combines all tools and techniques introduced throughout the book in a real-world application, and sketch how the presented approach can scale to larger, multi-domain scenarios in the context of the Semantic Web.


اگر به کتاب دسترسی ندارید با بنده تماس بگیرید rasooli{AT}cs.columbia{DOT}edu



برچسب‌ها: کتاب, هستان‌شناسی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیستم فروردین 1393 و ساعت 8:26 |

ویرایش دوم این کتاب اخیراْ منتشر شده است.


Claudia Leacock, Martin Chodorow, Michael Gamon and Joel Tetreault. Automated Grammatical Error Detection for Language Learners, Second Edition. Synthesis Lectures on Human Language Technologies, Morgan & Claypool Publishers, 2014.


Abstract

It has been estimated that over a billion people are using or learning English as a second or foreign language, and the numbers are growing not only for English but for other languages as well. These language learners provide a burgeoning market for tools that help identify and correct learners' writing errors. Unfortunately, the errors targeted by typical commercial proofreading tools do not include those aspects of a second language that are hardest to learn. This volume describes the types of constructions English language learners find most difficult: constructions containing prepositions, articles, and collocations. It provides an overview of the automated approaches that have been developed to identify and correct these and other classes of learner errors in a number of languages.

Error annotation and system evaluation are particularly important topics in grammatical error detection because there are no commonly accepted standards. Chapters in the book describe the options available to researchers, recommend best practices for reporting results, and present annotation and evaluation schemes.

The final chapters explore recent innovative work that opens new directions for research. It is the authors' hope that this volume will continue to contribute to the growing interest in grammatical error detection by encouraging researchers to take a closer look at the field and its many challenging problems.


اگر به کتاب دسترسی ندارید با بنده تماس بگیرید rasooli{AT}cs.columbia{DOT}edu


برچسب‌ها: کتاب, خطایابی دستوری
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیست و یکم اسفند 1392 و ساعت 19:34 |
این مقاله قرار است در همایش EACL 2014 در سوئد ارائه شود. 


Habibollah Asghari, Heshaam Faili, Jalal Maleki. A Probabilistic Approach to Persian Ezafe Recognition. 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Sweden, April 2014.


Abstract

In this paper, we investigate the problem of  Ezafe recognition in Persian language. Ezafe is  an unstressed vowel that is usually not written,  but is intelligently recognized and pronounced  by human. Ezafe marker can be placed into  noun phrases, adjective phrases and some  prepositional phrases linking head and  modifies. Persian Ezafe recognition is indeed a  homograph disambiguation problem, which is  a useful task for some language applications in  Persian like TTS. In this paper, POS tags  augmented by Ezafe tags (POSE) have been  used to train a probabilistic model for Ezafe  recognition. In order to build this model, a ten  million word tagged corpus was used for  training the system. For building the  probabilistic model, three different approaches  were used; Maximum Entropy POSE tagger,  Conditional Random Fields POSE tagger and  also a statistical machine translation approach  based on parallel corpus. The results show that  in comparison with previous works, the use of  Conditional Random Fields POSE tagger can  achieve outstanding results.  


برچسب‌ها: مقاله, پردازش زبان فارسی
+ نوشته شده توسط محمّد صادق رسولی در چهارشنبه بیست و یکم اسفند 1392 و ساعت 7:0 |

این مقاله قرار است در همایش EACL 2014 در سوئد ارائه شود. در این مقاله با استفاده از روش‌های تجزیهٔ‌وابستگی علاوه بر تجزیهٔ‌ نحوی، ناروانی موجود در جملات رفع می‌شود. این مقاله گسترش روش ارائه شده در مقالهٔ EMNLP 2013 است.


Mohammad Sadegh Rasooli and Joel Tetreault. Non-Monotonic Parsing of Fluent Umm I mean Disfluent Sentences. 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Sweden, April 2014.

Abstract

Parsing disfluent sentences is a challenging task which involves detecting disfluencies as well as identifying the syntactic structure of the sentence. While there have been several studies recently into solely detecting disfluencies at a high performance level, there has been relatively little work into joint parsing and disfluency detection that has reached that state-of-the-art performance in disfluency detection. We improve upon recent work in this joint task through the use of novel features and learning cascades to produce a model which performs at 82.6 F-score.  It outperforms the previous best in disfluency detection on two different evaluations.


برچسب‌ها: مقاله, تجزیه, پردازش گفتار, یادگیری خودکار, تشخیص ناروانی
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه سیزدهم اسفند 1392 و ساعت 21:40 |
این ابزار برای استخراج اطلاعات آزاد از متون زبان فارسی توسعه یافته است.


پیوند

اسناد


برچسب‌ها: پردازش زبان فارسی, ابزارهای پردازشی, استخراج اطلاعات
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه ششم اسفند 1392 و ساعت 21:51 |
این ابزار متن‌باز برای اصلاح نویسه‌ها، پیش‌پردازش املایی، برچسب‌زنی اجزای سخن و تجزیهٔ وابستگی زبان فارسی توسعه یافته است.


پیوند


نسخهٔ نمایشی تحت وب


برچسب‌ها: پردازش زبان فارسی, ابزارهای پردازشی, تجزیه, برچسب‌زنی
+ نوشته شده توسط محمّد صادق رسولی در سه شنبه ششم اسفند 1392 و ساعت 5:48 |