Corpus Linguistic Tools for Historical Semantics in Arabic


Abstract

In this paper, we present a set of corpus linguistic tools for conducting historical semantic research in the Arabic language. We compiled a Historical Arabic Corpus (HAC) that spans more than 1500 years of continuous language use. With techniques from the field of Natural Language Processing (NLP), the tools we presented here have been used to create the HAC and to explore lexical semantic change. The development of these tools is aimed at offering a catalyst to the ambitions goal of compiling an Arabic dictionary on historical principles. HAC and the tools can also be used for conducting research in a variety of areas of linguistics.

Authors

Omaima Ismail, Sane Yagi, Bassam Hammo

DOI

Keywords

References

  1. Abbès, Ramzi, and Joseph Dichy. (2008). ‘AraConc, an Arabic concordance software based on the DIINAR. 1 language resource’. The 6th International Conference on Informatics and Systems,. Giza, Egypt. 127- 134.
  2. Alansary, Sameh, Magdy Nagi, and Noha Adly. (2007). ‘Building an international corpus of Arabic (ICA): progress of compilation stage’. The 7th International Conference on Language Engineering, Cairo, Egypt, 5–6 December 2007. 1-30.
  3. Alansary, Sameh, Magdy Nagi, and Noha Adly. (2008). ‘Towards Analyzing the International Corpus of Arabic (ICA): Progress of Morphological Stage’. 8th International Conference on Language Engineering, Egypt. 1-23.
  4. Al-Daimi, K., and Abdel-Amir, M. (1994). “The Syntactic Analysis of Arabic by Machine”. Computers and Humanities, Vol. 28, No. 1, pp. 29-37.
  5. Al-Sulaiti, Latifa, and Eric Atwell. (2006). ‘The design of a corpus of contemporary Arabic’. International Journal of Corpus Linguistics, 11(2), 135-171.
  6. Al-Sulaiti, Latifa. (2004). Designing and developing a corpus of contemporary Arabic. Doctoral dissertation, University of Leeds (School of Computing), UK.
  7. Anthony, Laurence. (2005). ‘AntConc: design and development of a freeware corpus analysis toolkit for the technical writing classroom.’ Professional Communication Conference, 2005. IPCC 2005. Proceedings. International. IEEE.
  8. ARCHER (A Representative Corpus of Historical English Registers). (2014). Available at: http://www.helsinki.fi/varieng/CoRD/corpora/ARCHER/. (Accessed on 20.12.2014).
  9. Attia, Mohammed, et al. (2011). ‘Lexical Profiling for Arabic’. Proceedings of eLex: 23-33.
  10. Attia, Mohammed, Lamia Tounsi, and Josef van Genabith. (2010). ‘Automatic Lexical Resource Acquisition for Constructing an LMF- Compatible Lexicon of Modern Standard Arabic’. Technical report, The NCLT Seminar Series, DCU, Dublin, Ireland.
  11. Boella, Marco, et al. (2011). ‘The SALAH Project: segmentation and linguistic analysis of Ḥadīṯ Arabic texts’. Information Retrieval Technology. Springer Berlin Heidelberg. 538-549.
  12. Dukes, Kais, and Nizar Habash. (2010). ‘Morphological Annotation of Quranic Arabic’. LREC.
  13. Friz, Gerd. (2012). ‘Theories of Meaning Change: An Overview’. In Claudia Maienborn, Klaus von Heusinger, and Paul Portner (eds.), Semantics: An International Handbook of Natural Language Meaning, Vol. 3. Walter de Gruyter. 2625-2651.
  14. Hajjar, Mohammad, et al. (2010). ‘An Improved Structured and Progressive Electronic Dictionary for the Arabic Language: iSPEDAL’. Internet and Web Applications and Services (ICIW), Fifth International Conference on. IEEE.
  15. Helsinki Corpus of English Texts. (2011). Department of Modern Languages, University of Helsinki. Available at: http://www.helsinki.fi/varieng/CoRD/corpora/HelsinkiCorpus/HC_XML .html. (Accessed 15 January 2015).
  16. Khoja, Shereen, and Roger Garside. (1999). ‘Stemming Arabic text’. Technical report, Computing Department, Lancaster University, Lancaster, UK.
  17. Khoja, Shereen. (2009). ‘An RSS feed analysis application and corpus builder’. Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt.. 115–118.
  18. Khoja Stemmer. Available at: http://zeus.cs.pacificu.edu/shereen/research.htm. Retrieved in March. 2014.
  19. O’Donnell, Mick. (2008). ‘The UAM CorpusTool: Software for corpus annotation and exploration’. Proceedings of the XXVI Congreso de AESLA.
  20. Roberts, Andrew, Latifa Al-Sulaiti, and Eric Atwell. (2006). ‘aConCorde: Towards an open-source, extendable concordancer for Arabic’. Corpora 1.1.
  21. Sánchez-Marco, Cristina, Gemma Boleda, and Lluís Padró. (2011). ‘Extending the tool, or how to annotate historical language varieties’. Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities. Association for Computational Linguistics.
  22. Sharaf, Abdul-Baquee M., and Eric Atwell. (2012a). ‘QurAna: Corpus of the Quran annotated with Pronominal Anaphora’. LREC.
  23. Sharaf, Abdul-Baquee M., and Eric Atwell. (2012b). ‘QurSim: A corpus for evaluation of relatedness in short texts’. LREC.
  24. Stanford Tagger. Available at: http://nlp.stanford.edu/downloads/tagger.shtml. Retrieved in March. 2014.
  25. Toutanova, Kristina, et al. (2003). ‘Feature-rich part-of-speech tagging with a cyclic dependency network’. Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Association for Computational Linguistics.