European Journal of Business Science and Technology 2020, 6(2):154-169 | DOI: 10.11118/ejobsat.2020.010

Quality of Word Vectors and its Impact on Named Entity Recognition in Czech

František Dařena1, Martin Süss1
1 Mendel University in Brno, Czech Republic

Named Entity Recognition (NER) focuses on finding named entities in text and classifying them into one of the entity types. Modern state-of-the-art NER approaches avoid using hand-crafted features and rely on feature-inferring neural network systems based on word embeddings. The paper analyzes the impact of different aspects related to word embeddings on the process and results of the named entity recognition task in Czech, which has not been investigated so far. Various aspects of word vectors preparation were experimentally examined to draw useful conclusions. The suitable settings in different steps were determined, including the used corpus, number of word vectors dimensions, used text preprocessing techniques, context window size, number of training epochs, and word vectors inferring algorithms and their specific parameters. The paper demonstrates that focusing on the process of word vectors preparation can bring a significant improvement for NER in Czech even without using additional language independent and dependent resources.

Keywords: Named Entity Recognition, word embeddings, word vectors training, natural language processing, Czech language
JEL classification: C63, C88

Received: December 4, 2020; Revised: December 21, 2020; Accepted: December 21, 2020; Published: December 29, 2020  Show citation

ACS AIP APA ASA Harvard Chicago IEEE ISO690 MLA NLM Turabian Vancouver
Dařena, F., & Süss, M. (2020). Quality of Word Vectors and its Impact on Named Entity Recognition in Czech. European Journal of Business Science and Technology6(2), 154-169. doi: 10.11118/ejobsat.2020.010
Download citation

References

  1. Blum, A. L. and Langley, P. 1997. Selection of Relevant Features and Examples in Machine Learning. Artificial Intelligence, 97 (1-2), 245-271. DOI: 10.1016/S0004-3702(97)00063-5. Go to original source...
  2. Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. 2017. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, 5, 135-146. DOI: 10.1162/tacl_a_00051. Go to original source...
  3. Chen, G., Liu, T., Zhang, D., Yu, B. and Wang, B. 2018. Complex Named Entity Recognition via Deep Multi-Task Learning from Scratch. In Zhang, M., Ng, V., Zhao, D., Li, S. and Zan, H. (eds.). Natural Language Processing and Chinese Computing: Proceedings, Part I, pp. 221-233. Springer International Publishing, Cham. DOI: 10.1007/978-3-319-99495-6_19. Go to original source...
  4. Chiu, J. P. C. and Nichols, E. 2016. Named Entity Recognition with Bidirectional LSTMCNNs. Transactions of the Association for Computational Linguistics, 4, 357-370. DOI: 10.1162/tacl_a_00104. Go to original source...
  5. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K. and Kuksa, P. 2011. Natural Language Processing (Almost) from Scratch. Journal of Machine Learning Research, 12, 2493-2537.
  6. Dařena, F. 2019. VecText: Converting Documents to Vectors. IAENG International Journal of Computer Science, 46 (2), 170-177.
  7. Demir, H. and Özgür, A. 2014. Improving Named Entity Recognition for Morphologically Rich Languages Using Word Embeddings. In 2014 13th International Conference on Machine Learning and Applications, pp. 117-122. DOI: 10.1109/ICMLA.2014.24. Go to original source...
  8. El Bazi, I. and Laachfoubi, N. 2019. Arabic Named Entity Recognition Using Deep Learning Approach. International Journal of Electrical and Computer Engineering, 9 (3), 2025-2032. DOI: 10.11591/ijece.v9i3.pp2025-2032. Go to original source...
  9. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D. S. and Yates, A. 2005. Unsupervised Named-Entity Extraction from the Web: An Experimental Study. Artificial Intelligence, 165 (1), 91-134. DOI: 10.1016/j.artint.2005.03.001. Go to original source...
  10. Fares, M., Kutuzov, A., Oepen, S. and Velldal, E. 2017. Word Vectors, Reuse, and Replicability: Towards a Community Repository of Large-Text Resources. In Tiedemann, J. (ed.). Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 271-276.
  11. Feurer, M. and Hutter, F. 2019. Hyperparameter Optimization. In Hutter, F., Kotthoff, L. and Vanschoren, J. (eds.). Automated Machine Learning: Methods, Systems, Challenges, pp. 3-33. Springer International Publishing, Cham. Go to original source...
  12. Goldberg, Y. 2016. A Primer on Neural Network Models for Natural Language Processing. Journal of Artificial Intelligence Research, 57, 345-420. Go to original source...
  13. Goyal, A., Gupta, V. and Kumar, M. 2018. Recent Named Entity Recognition and Classification Techniques: A Systematic Review. Computer Science Review, 29, 21-43. DOI: 10.1016/j.cosrev.2018.06.001. Go to original source...
  14. Güngör, O., Üsküdarli, S. and Güngör, T. 2018. Improving Named Entity Recognition by Jointly Learning to Disambiguate Morphological Tags. In Proceedings of the 27th International Conference on Computational Linguistics, Association for Computational Linguistics, pp. 2082-2092.
  15. Hope, T., Resheff, Y. S. and Lieder, I. 2017. Learning TensorFlow: A Guide to Building Deep Learning Systems. O'Reilly Media.
  16. Kingma, D. P. and Ba, J. L. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference for Learning Representations. CoRR: abs/1412.6980.
  17. Konkol, M. and Konopík, M. 2011. Maximum Entropy Named Entity Recognition for Czech Language. In Habernal, I. and Matoušek, V. (eds.). Text, Speech and Dialogue: Proceedings, pp. 203-210. Go to original source...
  18. Konkol, M. and Konopík, M. 2013. CRF-Based Czech Named Entity Recognizer and Consolidation of Czech NER Research. In Habernal, I. and Matoušek, V. (eds.). Text, Speech, and Dialogue: Proceedings, pp. 153-160. Go to original source...
  19. Konopík, M. and Pražák, O. 2018. LDA in Character-LSTM-CRF Named Entity Recognition. In Sojka, P., Horák, A., Kopeček, I. And Pala, K. (eds.). Text, Speech, and Dialogue: Proceedings, pp. 58-66. Go to original source...
  20. Král, P. 2011. Features for Named Entity Recognition in Czech Language. In Filipe, J. and Dietz, J. (eds.). Proceedings of the International Conference on Knowledge Engineering and Ontology Development, Vol. 1, pp. 437-441. DOI: 10.5220/0003660104370441. Go to original source...
  21. Kravalová, J. and Žabokrtský, Z. 2009. Czech Named Entity Corpus and SVM-Based Recognizer. In Li, H. and Kumaran, A. (eds). Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, pp. 194-201. Go to original source...
  22. Křen, M., Cvrček, V., Čapka, T., Čermáková, A., Hnátková, M., Chlumská, L., Kováříková, D., Jelínek, T., Petkevič, V., Procházka, P., Skoumalová, H., Škrabal, M., Truneček, P., Vondřička, P. and Zasina, A. J. 2016. SYN2015: Representative Corpus of Contemporary Written Czech. In Calzolari, N., Choukri, K., Declerck, T., Goggi, S., Grobelnik, M., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J. and Piperidis, S. (eds.). Proceedings of the Tenth International Conference on Language Resources and Evaluation, pp. 2522-2528.
  23. Kulkarni, V., Mehdad, Y. and Chevalier, T. 2016. Domain Adaptation for Named Entity Recognition in Online Media with Word Embeddings. arXiv. CoRR: abs/1612.00148.
  24. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. 2016. Neural Architectures for Named Entity Recognition. In Knight, K., Nenkova, A. and Rambow, O. (eds.). Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260-270. DOI: 10.18653/v1/N16-1030. Go to original source...
  25. Leeuwenberg, A., Vela, M., Dehdari, J. and van Genabith, J. 2016. A Minimally Supervised Approach for Synonym Extraction with Word Embeddings. The Prague Bulletin of Mathematical Linguistics, 105, 111-142. DOI: 10.1515/pralin-2016-0006. Go to original source...
  26. Levy, O. and Goldberg, Y. 2014. Neural Word Embedding as Implicit Matrix Factorization. In Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N. D. and Weinberger, K. Q. (eds.). Advances in Neural Information Processing Systems 27, pp. 2177-2185.
  27. Levy, O., Goldberg, Y. and Dagan, I. 2015. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211-225. DOI: 10.1162/tacl_a_00134. Go to original source...
  28. Li, Q., Shah, S., Liu, X. and Nourbakhsh, A. 2017. Data Sets: Word Embeddings Learned from Tweets and General Data. In Proceedings of the Eleventh International AAAI Conference on Web and Social Media, pp. 428-436. Go to original source...
  29. Li, Y., Bontcheva, K. and Cunningham, H. 2005. SVM Based Learning System for Information Extraction. In Winkler, J., Niranjan, M. and Lawrence, N. (eds.). Deterministic and Statistical Methods in Machine Learning, pp. 319-339. DOI: 10.1007/11559887_19. Go to original source...
  30. Masaryk University, NLP Centre. 2011. czes. LINDAT/CLARIN Digital Library at the Institute of Formal and Applied Linguistics [online]. Available at: http://hdl.handle.net/11858/00-097C-0000-0001-CCCF-C.
  31. Mikolov, T., Chen, K., Corrado, G. S. and Dean, J. 2013a. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations. CoRR: abs/1301.3781.
  32. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. and Dean, J. 2013b. Distributed Representations of Words and Phrases and Their Compositionality. In Burges, C. J. C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. O. (eds.). Proceedings of the 26th International Conference
  33. on Neural Information Processing Systems, Vol. 2, pp. 3111-3119.
  34. Mikolov, T., Joulin, A., Chopra, S., Mathieu, M. and Ranzato, M. 2015. Learning Longer Memory in Recurrent Neural Networks. In 3rd International Conference on Learning Representations. CoRR: abs/1412.7753.
  35. Nadeau, D. and Sekine, S. 2007. A Survey of Named Entity Recognition and Classification. Lingvisticæ Investigationes, 30 (1), 3-26. DOI: 10.1075/li.30.1.03nad. Go to original source...
  36. Nadeau, D., Turney, P. D. and Matwin, S. 2006. Unsupervised Named-Entity Recognition: Generating Gazetteers and Resolving Ambiguity. In Lamontagne, L. and Marchand, M. (eds.). Advances in Artificial Intelligence: Proceedings, pp. 266-277. DOI: 10.1007/11766247_23. Go to original source...
  37. Nguyen, A.-D., Nguyen, K.-H. and Ngo, V.-V. 2019. Neural Sequence Labeling for Vietnamese POS Tagging and NER. In Proceedings: 2019 IEEE-RIVF International Conference on Computing and Communication Technologies. DOI: 10.1109/RIVF.2019.8713710. Go to original source...
  38. Pennington, J., Socher, R. and Manning, C. D. 2014. GloVe: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532-1543. DOI: 10.3115/v1/D14-1162. Go to original source...
  39. Rohde, D. L. T., Gonnerman, L. M. and Plaut, D. C. 2004. An Improved Method for Deriving Word Meaning from Lexical Co-Occurrence. Cognitive Psychology, 7, 573-605.
  40. Rong, X. 2014. word2vec Parameter Learning Explained. arXiv. CoRR: abs/1411.2738.
  41. Rudra Murthy, V. and Bhattacharyya, P. 2018. A Deep Learning Solution to Named Entity Recognition. In Gelbukh, A. (ed.). Computational Linguistics and Intelligent Text Processing: Revised Selected Papers, Part I, pp. 427-438. Go to original source...
  42. Seok, M., Song, H.-J., Park, C.-Y., Kim, J.-D. and Kim, Y.-S. 2016. Named Entity Recognition using Word Embedding as a Feature. International Journal of Software Engineering and its Applications, 10 (2), 93-104. DOI: 10.14257/ijseia.2016.10.2.08. Go to original source...
  43. Spoustová, J. and Spousta, M. 2012. A High-Quality Web Corpus of Czech. In Calzolari, N., Choukri, K., Declerck, T., Doğan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J. and Piperidis, S. (eds.). Proceedings of the Eighth International Conference on Language Resources and Evaluation, pp. 311-315.
  44. Straková, J., Straka, M. and Hajič, J. 2013. A New State-of-the-Art Czech Named Entity Recognizer. In Habernal, I. and Matoušek, V. (eds.). Text, Speech, and Dialogue: Proceedings, pp.6̇8-75. Go to original source...
  45. Straková, J., Straka, M. and Hajič, J. 2016. Neural Networks for Featureless Named Entity Recognition in Czech. In Sojka, P., Horák, A., Kopeček, I. and Pala, K. (eds.). Text, Speech, and Dialogue: Proceedings, pp. 173-181. Go to original source...
  46. Ševčíková, M., Žabokrtský, Z. and Krůza, O. 2007. Named Entities in Czech: Annotating Data and Developing NE Tagger. In Matoušek, V. and Mautner, P. (eds.). Text, Speech and Dialogue: Proceedings, pp. 188-195. Go to original source...
  47. Tkachenko, M. and Simanovsky, A. 2012. Named Entity Recognition: Exploring Features. In Proceedings of KONVENS 2012, Vol. 5, pp. 118-127.
  48. Yadav, V. and Bethard, S. 2018. A Survey on Recent Advances in Named Entity Recognition from Deep Learning Models. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 2145-2158.
  49. Yen, A.-Z., Huang, H.-H. and Chen, H.-H. 2017. Fusing Domain-Specific Data with General Data for In-Domain Applications. In Proceedings of the International Conference on Web Intelligence pp. 566-572. DOI: 10.1145/3106426.3106473. Go to original source...
  50. Zeman, D., Hajič, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J. and Petrov, S. 2018. CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 1-21. DOI: 10.18653/v1/K18-2001. Go to original source...
  51. Zhou, G. and Su, J. 2002. Named Entity Recognition using an HMM-Based Chunk Tagger. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 473-480. DOI: 10.3115/1073083.1073163. Go to original source...
  52. Žukov-Gregorič, A., Bachrach, Y. and Coope, S. 2018. Named Entity Recognition with Parallel Recurrent Neural Networks. In Gurevych, I. and Miyao, Y. (eds.). Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Vol. 2: Short Papers, pp. 69-74. DOI: 10.18653/v1/P18-2012. Go to original source...

This is an open access article distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0), which permits use, distribution, and reproduction in any medium, provided the original publication is properly cited. No use, distribution or reproduction is permitted which does not comply with these terms.