Part-of-Speech and Morphological Tagging of Algerian Judeo-Arabic
2022, Northern European Journal of Language Technology
Abstract
Most linguistic studies of Judeo-Arabic, the ensemble of dialects spoken and written by Jews in Arab lands, are qualitative in nature and rely on laborious manual annotation work, and are therefore limited in scale. In this work, we develop automatic methods for morpho-syntactic tagging of Algerian Judeo-Arabic texts published by Algerian Jews in the 19th-20th centuries, based on a linguistically tagged corpus. First, we describe our semi-automatic approach for preprocessing these texts. Then, we experiment with both an off-the-shelf morphological tagger, several specially designed neural network taggers, and a hybrid human-in-the-loop approach. Finally, we perform a real-world evaluation of new texts that were never tagged before in comparison with human expert annotators. Our experimental results demonstrate that these methods can dramatically speed up and improve the linguistic research pipeline, enabling linguists to study these dialects on a much greater scale.
Key takeaways
AI
AI
- Developed AJATag to automate morpho-syntactic tagging of Algerian Judeo-Arabic, enhancing linguistic research scalability.
- AJATag achieved over 91% overall accuracy, outperforming the benchmark MarMoT by 2% in real-world evaluations.
- The TAJA corpus comprises 9904 sentences and 61,481 tokens, crucial for training the tagging models.
- Hierarchical models showed improved performance with out-of-vocabulary words, reaching 78.42% accuracy on analysis2.
- The study addresses the scarcity of AJA linguists, facilitating large-scale analysis of dialects through automation.
References (65)
- Abdelali, Ahmed, Kareem Darwish, Nadir Durrani, and Hamdy Mubarak. 2016. Farasa: A fast and furious segmenter for Arabic. In Proceedings of the 2016 Con- ference of the North American Chapter of the Associ- ation for Computational Linguistics: Demonstrations, pages 11-16, San Diego, California. Association for Computational Linguistics.
- Abdelali, Ahmed, Hamdy Mubarak, Younes Samih, Sabit Hassan, and Kareem Darwish. 2021. QADI: Ara- bic dialect identification in the wild. In Proceedings of the Sixth Arabic Natural Language Processing Work- shop, pages 1-10, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
- Ahmed, Mohamed AH. 2018. Xml annotation of hebrew elements in judeo-arabic texts. Journal of Jewish Lan- guages, 6(2):221-242.
- Altaher, Yousef, Ali Fadel, Mazen Alotaibi, Mazen Alyazidi, Mishari Al-Mutairi, Mutlaq Aldhbuiub, Ab- dulrahman Mosaibah, Abdelrahman Rezk, Abdulraz- zaq Alhendi, Mazen Abo Shal, et al. 2022. Masader plus: A new interface for exploring+ 500 arabic nlp datasets. arXiv preprint arXiv:2208.00932.
- Alyafeai, Zaid, Maraim Masoud, Mustafa Ghaleb, and Maged S. Al-shaibani. 2022. Masader: Metadata sourcing for Arabic text and speech data resources. In Northern European Journal of Language Technology Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6340-6351, Marseille, France. European Language Resources Association.
- Attia, Mohammed, Younes Samih, Ali Elkahky, Hamdy Mubarak, Ahmed Abdelali, and Kareem Darwish. 2019. POS tagging for improving code-switching identification in Arabic. In Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 18-29, Florence, Italy. Association for Computational Linguistics.
- Ballesteros, Miguel, Chris Dyer, and Noah A. Smith. 2015. Improved transition-based parsing by model- ing characters instead of words with LSTMs. In Pro- ceedings of the 2015 Conference on Empirical Meth- ods in Natural Language Processing, pages 349-359, Lisbon, Portugal. Association for Computational Lin- guistics.
- Bar-Asher, Moshe. 1992. La composante hebraïque du judeo-arabe algerien: communautés de Tlemcen et Aïn- Témouchent. Magnes, Jerusalem.
- Belinkov, Yonatan. 2021. Large-scale electronic cor- pora and the study of middle and mixed Arabic. In Middle and Mixed Arabic over Time and across Writ- ten and Oral Genres: From Legal Documents to Tele- vision and Internet through Literature. Proceedings of the IVth AIMA International Conference (Emory Uni- versity, Atlanta, GA, USA, 12-15 October 2013), Pub- lications de l'Institut Orientaliste de Louvain, pages 43-67, Université catholique de Louvain, Louvain-la- Neuve. Peeters.
- Bouamor, Houda, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdul- rahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann, et al. 2018. The MADAR Arabic dialect corpus and lexicon. In LREC.
- Caruana, Rich. 1997. Multitask learning. Machine Learning, 28.
- Çetinoğlu, Özlem, Sarah Schulz, and Ngoc Thang Vu. 2016. Challenges of computational processing of code-switching. In Proceedings of the Second Work- shop on Computational Approaches to Code Switching, pages 1-11, Austin, Texas. Association for Computa- tional Linguistics.
- Cohen, Marcel. 1912. Le parler arabe des Juifs d'Alger. Collection linguistique, pub. par la Société de linguis- tique de Paris-4. H. Champion, Paris.
- Darwish, Kareem, Mohammed Attia, Hamdy Mubarak, Younes Samih, Ahmed Abdelali, L. Màrquez, M. El- desouki, and Laura Kallmeyer. 2020. Effective multi- dialectal Arabic POS tagging. Natural Language En- gineering, 26:677 -690.
- Darwish, Kareem, Hamdy Mubarak, Ahmed Abde- lali, Mohamed Eldesouki, Younes Samih, Randah Al- harbi, Mohammed Attia, Walid Magdy, and Laura Kallmeyer. 2018. Multi-dialect Arabic POS tagging: A CRF approach. In Proceedings of the Eleventh Inter- national Conference on Language Resources and Eval- uation (LREC 2018), Miyazaki, Japan. European Lan- guage Resources Association (ELRA).
- Dermatas, Evangelos and George Kokkinakis. 1995. Au- tomatic stochastic tagging of natural language texts. Computational Linguistics, 21(2):137-163.
- Diab, Mona. 2009. Second generation AMIRA tools for Arabic processing: Fast and robust tokenization, POS tagging, and base phrase chunking. In In 2nd Interna- tional Conference on Arabic Language Resources and Tools.
- Dos Santos, Cicero and Bianca Zadrozny. 2014. Learning character-level representations for part-of- speech tagging. In International Conference on Ma- chine Learning, pages 1818-1826. PMLR.
- Duh, Kevin and Katrin Kirchhoff. 2005. POS tagging of dialectal Arabic: A minimally supervised approach. In Proceedings of the ACL Workshop on Computa- tional Approaches to Semitic Languages, pages 55-62, Ann Arbor, Michigan. Association for Computational Linguistics.
- El-Haj, Mahmoud. 2020. Habibi -a multi dialect multi national Arabic song lyrics corpus. In Proceedings of the 12th Language Resources and Evaluation Confer- ence, pages 1318-1326, Marseille, France. European Language Resources Association.
- El-Haj, Mahmoud and Rim Koulali. 2013. KALIMAT a multipurpose Arabic corpus. In Second workshop on Arabic corpus linguistics (WACL-2), pages 22-25.
- Ferguson, Charles A. 1959. Diglossia. WORD, 15(2):325- 340.
- Habash, Nizar, Mona Diab, and Owen Rambow. 2012. Conventional orthography for dialectal Arabic. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 711-718, Istanbul, Turkey. European Language Re- sources Association (ELRA).
- Habash, Nizar, Ryan Roth, Owen Rambow, Ramy Es- kander, and Nadi Tomeh. 2013. Morphological anal- ysis and disambiguation for dialectal Arabic. In Pro- ceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 426-432, Atlanta, Georgia. Association for Computational Lin- guistics.
- Northern European Journal of Language Technology Habash, Nizar, Abdelhadi Soudi, and Timothy Buck- walter. 2007. On Arabic Transliteration, volume 38, chapter 2. Springer Netherlands, Dordrecht.
- Hajič, Jan, Otakar Smrž, Petr Zemánek, Petr Pajas, Jan Šnaidauf, Emanuel Beška, Jakub Kracmar, and Kamila Hassanová. 2009. Prague Arabic dependency treebank 1.0. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
- Hary, Benjamin. 2003. Judeo-Arabic: A diachronic reex- amination. International Journal of The Sociology of Language, 2003:61-75.
- Hochreiter, Sepp and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735- 1780.
- Inoue, Go, Salam Khalifa, and Nizar Habash. 2022. Morphosyntactic tagging with pre-trained language models for Arabic and its dialects. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1708-1719, Dublin, Ireland. Association for Computational Linguistics.
- Inoue, Go, Hiroyuki Shindo, and Yuji Matsumoto. 2017. Joint prediction of morphosyntactic categories for fine-grained Arabic part-of-speech tagging exploit- ing tag dictionary information. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 421-431, Vancouver, Canada. Association for Computational Linguistics.
- Kahn, Lily and Aaron D Rubin. 2017. Handbook of Jewish Languages: Revised and Updated Edition. Brill.
- Kim, Yoon, Yacine Jernite, David Sontag, and Alexan- der M. Rush. 2016. Character-aware neural language models. In Proceedings of the Thirtieth AAAI Con- ference on Artificial Intelligence, AAAI'16, page 2741- 2749. AAAI Press.
- Lafferty, John, Andrew Mccallum, and Fernando Pereira. 2001. Conditional random fields: Proba- bilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pages 282-289.
- Ling, Wang, Chris Dyer, Alan W Black, Isabel Trancoso, Ramón Fermandez, Silvio Amir, Luís Marujo, and Tiago Luís. 2015. Finding function in form: Compo- sitional character models for open vocabulary word representation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Process- ing, pages 1520-1530, Lisbon, Portugal. Association for Computational Linguistics.
- Luong, Minh-Thang and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), pages 1054-1063, Berlin, Germany. Associa- tion for Computational Linguistics.
- Maamouri, Mohamed, Ann Bies, Tim Buckwalter, and Wigdan Mekki. 2004. The Penn Arabic treebank : Building a large-scale annotated Arabic corpus. In NEMLAR conference on Arabic language resources and tools, volume 27, pages 466-467. Cairo.
- Maamouri, Mohamed, Ann Bies, Seth Kulick, Michael Ciul, Nizar Habash, and Ramy Eskander. 2014. De- veloping an Egyptian Arabic treebank: Impact of di- alectal morphology on annotation and tool devel- opment. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2348-2354, Reykjavik, Iceland. Eu- ropean Language Resources Association (ELRA).
- McCarthy, John J. 1981. A prosodic theory of non-concatenative morphology. Linguistic Inquiry, 12:373-418.
- Müller, Thomas, Helmut Schmid, and Hinrich Schütze. 2013. Efficient higher-order CRFs for morphological tagging. In Proceedings of the 2013 Conference on Em- pirical Methods in Natural Language Processing.
- Nivre, Joakim, Željko Agić, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Balles- teros, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Cristina Bosco, Sam Bowman, Giuseppe G. A. Celano, Miriam Connor, Marie-Catherine de Marn- effe, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Tim- othy Dozat, Tomaž Erjavec, Richárd Farkas, Jen- nifer Foster, Daniel Galbraith, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Yoav Goldberg, Berta Gonzales, Bruno Guillaume, Jan Hajič, Dag Haug, Radu Ion, Elena Irimia, Anders Johannsen, Hiroshi Kanayama, Jenna Kanerva, Simon Krek, Veronika Laippala, Alessandro Lenci, Nikola Ljubešić, Teresa Lynn, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martínez Alonso, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simonetta Mon- temagni, Shunsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Pas- sarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Pi- itulainen, Barbara Plank, Martin Popel, Prokopis Prokopidis, Sampo Pyysalo, Loganathan Ramasamy, Rudolf Rosa, Shadi Saleh, Sebastian Schuster, Wolf- gang Seeker, Mojgan Seraji, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Kiril Simov, Northern European Journal of Language Technology Aaron Smith, Jan Štěpánek, Alane Suhr, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Sumire Uematsu, Larraitz Uria, Viktor Varga, Veronika Vincze, Zdeněk Žabokrtský, Daniel Zeman, and Hanzhi Zhu. 2015. Universal dependencies 1.2. LINDAT/CLARIAH-CZ digital library at the Institute of Formal and Ap- plied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles University.
- Owens, J. 2013. The Oxford Handbook of Arabic Linguis- tics. Oxford Handbooks. Oxford University Press.
- Pasha, Arfath, Mohamed Al-Badrashiny, Mona Diab, Ahmed El Kholy, Ramy Eskander, Nizar Habash, Manoj Pooleery, Owen Rambow, and Ryan Roth. 2014. MADAMIRA: A fast, comprehensive tool for morphological analysis and disambiguation of Ara- bic. In Proceedings of the Ninth International Confer- ence on Language Resources and Evaluation (LREC'14), pages 1094-1101, Reykjavik, Iceland. European Lan- guage Resources Association (ELRA).
- Paszke, Adam, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Te- jani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. Pytorch: An imperative style, high-performance deep learning li- brary. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Ad- vances in Neural Information Processing Systems 32, pages 8024-8035. Curran Associates, Inc.
- Plank, Barbara, Anders Søgaard, and Yoav Goldberg. 2016. Multilingual part-of-speech tagging with bidi- rectional long short-term memory models and auxil- iary loss. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 412-418, Berlin, Germany. As- sociation for Computational Linguistics.
- Riabi, Arij, Benoît Sagot, and Djamé Seddah. 2021. Can character-based language models improve down- stream task performances in low-resource and noisy language scenarios? In Proceedings of the Sev- enth Workshop on Noisy User-generated Text (W-NUT 2021), pages 423-436, Online. Association for Com- putational Linguistics.
- Roth, Ryan, Owen Rambow, Nizar Habash, Mona Diab, and Cynthia Rudin. 2008. Arabic morphological tag- ging, diacritization, and lemmatization using lexeme models and feature ranking. In Proceedings of ACL- 08: HLT, Short Papers, pages 117-120, Columbus, Ohio. Association for Computational Linguistics.
- Seddah, Djamé, Farah Essaidi, Amal Fethi, Matthieu Futeral, Benjamin Muller, Pedro Javier Ortiz Suárez, Benoît Sagot, and Abhishek Srivastava. 2020. Build- ing a user-generated content North-African Arabizi treebank: Tackling hell. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1139-1150, Online. Association for Computational Linguistics.
- Tedghi, Joseph. 2012. 'le livre de Jonas' traduit en judéo- arabe marocain par Samuel Malka: étude linguis- tique. In Dynamiques langagières en Arabophonies, pages 253-290, Zaragoza. Universidad de Zaragoza, Área de Estudios Árabes e Islámicos.
- Terner, Ori, Kfir Bar, and Nachum Dershowitz. 2020. Transliteration of Judeo-Arabic texts into Arabic script using recurrent neural networks. In Proceed- ings of the Fifth Arabic Natural Language Processing Workshop, pages 85-96, Barcelona, Spain (Online). Association for Computational Linguistics.
- Tirosh-Becker, Ofra. 1988. The phonology and top- ics in the morphology of a Judeo-Arabic translation of the book of Psalms from Constantine (Algeria) /
- אלג'יריה( מקונסטנטין בערבי��-יהודית תהילים לספר תרגום של במורפולוגיה ופרקים .פונולוגיה Master's thesis, The He- brew University of Jerusalem.
- Tirosh-Becker, Ofra. 1989. On the linguistic uniformity in the "Šarḥ" of the Jews of Constantine / קונסטנטין יהודי של בשרח הלשון אחדות .לשאלת Proceedings of the World Congress of Jewish Studies היהדות/ למדעי העולמי הקונגרס ,דברי י:791-402.
- Tirosh-Becker, Ofra. 2011a. Old and new in the transla- tion and commentary of Avot tractate / ופירושה אבות משנת בתרגום וחדש .ישן In Hikrei ma'arav u-mizrah: studies in language, literature and history presented to Joseph Chetrit (Volume 1) / (1 כרך( שיטרית ליוסף מוגשים תולדה ופרקי ספרויות לשונות, ומזרח: מערב ,חקרי Hikrei ma'arav u-mizrah : studies in language, literature and history presented to Joseph Chetrit / שיטרית ליוסף מוגשים תולדה ופרקי ספרויות לשונות, ומזרח: מערב .חקרי Carmel.
- Tirosh-Becker, Ofra. 2011b. On dialectal roots in Judeo- Arabic texts from Constantine (east Algeria). Revue des Études Juives, 170:227-253.
- Tirosh-Becker, Ofra. 2011c. Terms for realia in an Alge- rian Judeo-Arabic translation of the Hošaʿnot. Studies in the Culture of North African Jewry, 1:171-186.
- Tirosh-Becker, Ofra. 2012. Mixed linguistic features in a Judeo-Arabic text from Algeria: The Šarḥ to the Hafṭarot from Constantine. In Language and Nature: Papers presented to John Huehnergard on the Occasion Northern European Journal of Language Technology of his 60th Birthday, pages 391-406, Chicago. Oxford University Press.
- Tirosh-Becker, Ofra. 2014. A reflection of a linguistic reality: An Algerian Judeo-Arabic book for the new year. Studies in the Culture of North African Jewry, 3:193-216.
- Tirosh-Becker, Ofra and Oren M. Becker. 2022. TAJA corpus: Linguistically tagged written Algerian Judeo- Arabic corpus. Journal of Jewish Languages, 10(1):24 -53.
- Wagner, Esther-Miriam and Magdalen Connolly. 2018. Code-switching in judaeo-arabic documents from the cairo geniza. Multilingua, 37(1):1-23.
- Zaidan, Omar and Chris Callison-Burch. 2014. Ara- bic dialect identification. Computational Linguistics, 40:171-202.
- Zalmout, Nasser and Nizar Habash. 2019. Adversarial multitask learning for joint multi-feature and multi- dialect morphological modeling. In Proceedings of the 57th Annual Meeting of the Association for Compu- tational Linguistics, pages 1775-1786, Florence, Italy. Association for Computational Linguistics.
- Zalmout, Nasser and Nizar Habash. 2020. Joint dia- critization, lemmatization, normalization, and fine- grained morphological tagging. In Proceedings of the 58th Annual Meeting of the Association for Computa- tional Linguistics, pages 8297-8307, Online. Associa- tion for Computational Linguistics.
- Zhang, Wen, Yang Feng, Fandong Meng, Di You, and Qun Liu. 2019. Bridging the gap between training and inference for neural machine translation. In Pro- ceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4334-4343, Flo- rence, Italy. Association for Computational Linguis- tics.
- Zitouni, Imed. 2014. Natural Language Processing of Semitic Languages. Springer.
- Zribi, Inès, Mariem Ellouze, Lamia Belguith, and Philippe Blache. 2015. Spoken Tunisian Arabic cor- pus "STAC": Transcription and annotation. Research in Computing Science, 90.
FAQs
AI
What were the results of the hierarchical chaR-cnn model for morphological tagging accuracy?
The hierarchical chaR-cnn model achieved accuracies of 89% for analysis1 and 92.7% for analysis2, outperforming MarMoT's 82.3% and 85.6%, respectively.
How did the accuracy of the AJA tagger compare for Out of Vocabulary (OOV) words?
AJATag's accuracy for morphologically tagging OOV words reached 74.91% for analysis1, significantly better than MarMoT's 55.82%.
What was the significant challenge addressed by this study regarding AJA linguistic analysis?
The study tackled the difficulty of scaling linguistic analysis due to the rich morphology of AJA and decreasing speaker expertise.
How did the authors evaluate the practicality of their NLP models in a real-world setting?
They conducted a user study with expert AJA linguists on unannotated texts from the NAJA corpus, scoring model accuracy against human annotations.
What method was used to assess inter-annotator agreement between expert linguists?
Cohen's Kappa was calculated, yielding a score of 0.875, indicating high agreement between annotations from a senior expert and a junior expert.
Ofra Tirosh-Becker