Publications

You can find the full list of my articles on Google Scholar profile.

naab: A ready-to-use plug-and-play corpus for Farsi

Published in arXiv preprint arXiv:2208.13486, 2022

Huge corpora of textual data are always known to be a crucial need for training deep models such as transformer-based ones. This issue is emerging more in lower resource languages - like Farsi. We propose naab, the biggest cleaned and ready-to-use open-source textual corpus in Farsi. It contains about 130GB of data, 250 million paragraphs, and 15 billion words. The project name is derived from the Farsi word NAAB K which means pure and high grade. We also provide the raw version of the corpus called naab-raw and an easy-to-use preprocessor that can be employed by those who wanted to make a customized corpus.

Recommended citation: Sadra Sabouri, Elnaz Rahmati, Soroush Gooran, and Hossein Sameti. naab: A ready-to-use plug-and-play corpus for farsi. arXiv preprint arXiv:2208.13486, 2022. https://arxiv.org/pdf/2208.13486

Docalog: Multi-document Dialogue System using Transformer-based Span Retrieval

Published in Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, 2022

Information-seeking dialogue systems, including knowledge identification and response generation, aim to respond to users with fluent, coherent, and informative answers based on users’ needs. This paper discusses our proposed approach, Docalog, for the DialDoc-22 (MultiDoc2Dial) shared task. Docalog identifies the most relevant knowledge in the associated document, in a multi-document setting. Docalog, is a three-stage pipeline consisting of (1) a document retriever model (DR. TEIT),(2) an answer span prediction model, and (3) an ultimate span picker deciding on the most likely answer span, out of all predicted spans. In the test phase of MultiDoc2Dial 2022, Docalog achieved f1-scores of 36.07% and 28.44% and SacreBLEU scores of 23.70% and 20.52%, respectively on the MDD-SEEN and MDD-UNSEEN folds.

Recommended citation: Sayed Hesam Alavian, Ali Satvaty, Sadra Sabouri, Ehsaneddin Asgari, and Hossein Sameti. 2022. Docalog: Multi-document Dialogue System using Transformer-based Span Retrieval. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 142–147, Dublin, Ireland. Association for Computational Linguistics. https://aclanthology.org/2022.dialdoc-1.16/