QUALITY OF WORD VECTORS AND ITS IMPACT ON NAMED ENTITY RECOGNITION IN CZECH

Named Entity Recognition (NER) focuses on finding named entities in text and classifying them into one of the entity types. Modern state-of-the-art NER approaches avoid using handcrafted features and rely on feature-inferring neural network systems based on word embeddings. The paper analyzes the impact of different aspects related to word embeddings on the process and results of the named entity recognition task in Czech, which has not been investigated so far. Various aspects of word vectors preparation were experimentally examined to draw useful conclusions. The suitable settings in different steps were determined, including the used corpus, number of word vectors dimensions, used text preprocessing techniques, context window size, number of training epochs, and word vectors inferring algorithms and their specific parameters. The paper demonstrates that focusing on the process of word vectors preparation can bring a significant improvement for NER in Czech even without using additional language independent and dependent resources.


INTRODUCTION
Named Entity Recognition (NER) is one of the important subtasks of Information Extraction. It focuses on finding named entities in text and classifying them into one of the entity types. The types typically include persons, locations, organizations, temporal expressions, phone numbers, but sometimes also product names, brands, diagnoses, drug types, or pub-lishers (Goyal et al., 2018;Nadeau and Sekine, 2007).
Named entities can be extracted using several approaches. The knowledge-based, also known as rule-based approach relies on the availability of various lexicons and domain-specific knowledge (Yadav and Bethard, 2018). Knowledge-based systems can be usually easily implemented but it is difficult to define all necessary rules. The systems usually have high precision but, on the other hand, lower recall and fail on unknown cases.
Machine learning approaches strive to eliminate the problems with hand-crafted rules. The NER problem is being solved with a model automatically created by a computer. Systems using supervised learning require annotated corpora (a text with marked entities) and a learning algorithm that can automatically extract the rules for detecting entities. Systems based on unsupervised learning (no labeled data is available) require only some syntactic patterns to identify candidates for entities that can be further evaluated and disambiguated (Etzioni et al., 2005;Nadeau et al., 2006).
The crucial aspect of learning a NER model is the selection of the appropriate features. Modern state-of-the-art NER approaches avoid using hand-crafted features and rely on featureinferring neural network systems based on word embeddings. These systems often outperform the systems using engineered features, even when they have access to domain-specific rules or lexicons (Yadav and Bethard, 2018).
A lot of research concentrates on massively used languages, like English, German, or Spanish and there have been many approaches to named entity recognition developed. For the Czech language, the situation is quite different as there is a delay in current research. There exist only a few named entity recognizers and not much attention has been devoted to the optimization of all steps of the NER procedure. A typical example is a process of preparing word vectors to be used in the NER task. The goal of the paper is thus to analyze the impact of different aspects related to word embeddings on the process and results of the named entity recognition task in Czech. The goal is not to achieve the best results and beat the current state-of-the-art approaches, which usually requires using other languagedependent resources, but to discover how different algorithms, their parameters, or the size and quality of corpora used for training can influence the result.

CURRENT STATE
The methods of NER often employ statistics (e.g., Conditional Random Fields -see Tkachenko and Simanovsky, 2012; Hidden Markov Models -see Zhou and Su, 2002), classification algorithms (e.g., support vector machine -see Li et al., 2005), or neural approaches (Collobert et al., 2011). In the past, classical machine learning models like SVM or logistic regression strongly relying on feature engineering were popular in NER (Goldberg, 2016). The features generally belonging to one of the three categories -document, corpus, and word-based features (Goyal et al., 2018) usually include, e.g., word length, capitalization, presence in an external list, part-of-speech, position in a sentence, the occurrence of a period or hyphen, suffixes, prefixes, or orthographic features (Zhou and Su, 2002;Tkachenko and Simanovsky, 2012).
Later, it has been found that neural models (especially deep neural models) able to learn important features directly from texts could be used also for NER. A prevalent approach is now based on neural networks with architectures such as bidirectional or convolutional LSTM (Lample et al., 2016;Chiu and Nichols, 2016;Rudra Murthy and Bhattacharyya, 2018;Chen et al., 2018). Such architectures that are suitable for processing sequential data as they have a form of memory are successfully used also in other natural language processing tasks (Mikolov et al., 2015). After a pioneering publication on word vectors training using the word2vec algorithm (Mikolov et al., 2013a), the NER research was aimed at using word vectors also in NER in many natural languages (Nguyen et al., 2019;El Bazi and Laachfoubi, 2019;Seok et al., 2016). Word vectors (Collobert et al., 2011), which are vectors representing individual words, are able to capture the syntactic as well as semantic regularities of a language which has been found to be beneficial in many NLP tasks.
In order to learn word vectors using a neural model, texts need to be converted to a structured representation (vectors) first. The procedure can generally include several preprocessing steps like, e.g., text cleaning, white space removal, case folding, spelling errors corrections, abbreviations expanding, stemming, stop words removal, or negation handling (Dařena, 2019). For word embeddings training, some preprocessing can be applied too (Li et al., 2017;Leeuwenberg et al., 2016) which can have an impact on the context of the words, the number of unique words, and global word frequencies. Subsequently, one-hot encoded vectors (vector where only one out of its units is 1 and all others are 0) that act as the inputs and outputs of the neural models are derived (Rong, 2014). Various sets of word embeddings trained on different corpora (e.g., Wikipedia) are instantly available. Different algorithms can be also used to train their own set of embeddings, that are suitable for general use or specific task. The algorithms have various parameters that need to be set with respect to a given task. Current approaches to NER using word embeddings, however, often use the default parameters settings, and the impact of alternative settings is not evaluated.
Besides the core features derived from the text in a neural model, additional languagedependent (presence in a list of cities, countries, first and last names, days of a week, currencies, part-of-speech, singular/plural) or languageindependent features (context, position, word length, fixed length prefix/suffix, presence of a hyphen) can be on the input of a NER system (Chiu and Nichols, 2016). Ševčíková et al. (2007) presented the first NER system for the Czech language using decision trees analyzing handcrafted features to detect and classify entities in text. Kravalová and Žabokrtský (2009) implemented another system using SVM for classification. Král (2011) implemented a NER system for a specific purpose (searching the Czech press agency database) and demonstrated that feature selection plays a crucial role in designing a NER system. He proved that language independent features are more important than the dependent ones. Konkol and Konopík (2011) created a NER system using the Maximum Entropy algorithm which used semantic spaces that were created using the COALS method (Rohde et al., 2004). It is the first work that treated words as vectors in a multidimensional space. Another system employing Conditional Random Fields using different features and resources was presented by Konkol and Konopík (2013). In the same year, Straková et al. (2013) published another NER system for Czech using Maximum Entropy Markov Model.

NER for the Czech Language
The first system that employed word vectors trained using word2vec was presented by Demir and Özgür (2014). Although it used only language independent features it outperformed all existing NER systems for Czech. A better performance was later achieved by Straková et al. (2016) who use a neural network with gated recurrent units together with word vectors representing original or lemmatized words, part-of-speech tags, prefixes, suffixes, or vector representations of characters. The word vectors in both systems were trained using word2vec with the skipgram architecture. The best performance was brought by Konopík and Pražák (2018). They use a deep neural model with LSTM layers encoding character sequences and word sequences together with a wider context information obtained from Latent Dirichlet Allocation. The word sequence layer was using pretrained GloVe and fastText word vectors.
The systems using word vectors were able to improve the performance expressed by the F1measure by a few percent. At the same time, the features did not need to be engineered manually because many useful properties and relations were encoded in the vectors. Most of the systems, however, relied on pretrained word vectors or created the vectors using default parameters of the algorithms.

Learning Word Embeddings
In machine learning, there is a general problem with choosing the right set of features for the given task (Blum and Langley, 1997). In natural language processing, there is an additional problem related to the classical representation of features derived from texts (known as bagof-words). In this model, each word or another feature is represented by one dimension in a multidimensional space for representing the documents. Such a value does not enable sharing some information across features and is thus is independent of the others.
To solve the problem with no similarity among features, it would be possible to add other information to the existing features to better capture the context in which they appear. This, however, increases the number of dimensions in the input space and requires the combination of possible feature components to be carefully selected (Goldberg, 2016).
Some of the modern representations of texts use more dimensions to represent each word or feature. The words are embedded in a continuous multidimensional space that has typically a few hundred dimensions so we talk about word embeddings. Finding suitable values of the vector elements is based on the hypothesis stating that words in similar contexts have similar meanings (Levy and Goldberg, 2014). Because similar words (e.g., synonyms) share some information, the values of their vector elements should be similar and the vectors are located close to each other in the multidimensional space.
Popular approaches leading to generating such vectors include models using global matrix factorization like Latent Semantic Analysis (LSA) or Latent Dirichlet Allocation (LDA) and models learned by neural networks using a small context window (where word2vec is probably the most popular), see Mikolov et al. (2013a), Pennington et al. (2014). Supervised methods create embeddings that are trained towards the given goal and can capture information that is relevant for the task. They, however, require annotated data for the specific task. Unsupervised methods do not require annotated data. Their only goal is to compute embeddings that are usually learned in the task of predicting a word given its context or deciding, whether a word can belong to a context given examples of real and randomly created word-context pairs (Goldberg, 2016). Such embeddings capture general syntactic and semantic relationships and can be applied in a wide variety of tasks. When there is not enough data for domain-specific embeddings training available a model created on a general corpus can be adjusted using a smaller amount of domain-specific data (Yen et al., 2017).
Famous methods that can be used to compute word embeddings include: • word2vec -a family of methods proposed by Mikolov et al. (2013a) that strongly attracted the NLP community to neural language models. The method predicts a word based on its context (the Continuous bag-of-words or CBOW approach) or the context for a word (the skipgram approach). Word2vec tried to eliminate the problems with the computational complexity of the existing neural language models. In the training phase, a neural network uses a linear activation function instead of the sigmoid function, which is typical for a multilayer perceptron, and the logarithm of the probability of predicting the word or its context is being maximized.
• GloVe -a model uses information about global co-occurrences of words. Word vectors are used in the task of predicting the probability with which two words co-occur. The probabilities can be calculated from a term-term matrix created from a corpus. The prediction is made by a function that takes word vectors as the input. The word vectors are calculated in the process of word co-occurrence matrix factorization using stochastic gradient descent (Pennington et al., 2014).
• fastText -a model derived from word2vec, treats each word as a bag of character ngrams (which enables considering sub-word information very important for morphologically rich languages) where the vectors are associated at the n-gram level. The vector for a word is calculated as the sum of n-gram vectors. This enables, compared to word2vec and GloVe, creating vectors for words that are not in the training data (Bojanowski et al., 2017).
The skipgram technique in both word2vec and fastText algorithms can better capture semantic regularities of words. On the other hand, the CBOW approach captures syntactic regularities better (Mikolov et al., 2013b).
The methods of embeddings training require several parameters to be set -the number of vector dimensions, definition of the context (size and position), maximal number of unique words, minimal frequency of a word, number of training epochs, etc. which can significantly influence the quality of the learned vectors (Levy et al., 2015).

DATA AND METHODS
The quality of word vectors depends on the corpus on which they are trained. Generally, the more data is available, the better. However, the number of unique tokens, amount of errors, writing style, domain to which the texts are related etc. play a significant role also in the NER task. Here, what are an entity and its type often depends on the domain (Kulkarni et al., 2016). The number of unique tokens is especially high for morphologically rich languages where different forms of a word have an impact on the number of global occurrences as well as the number of combinations with other words. Here, normalization techniques, like stemming, lemmatization, case folding, or stop words removal can be considered (Levy et al., 2015).

Data
For the Czech language, no corpora suitable for the NER task existed before 2007 when the Czech Named Entity Corpus (CNEC) 1.0 was released (Ševčíková et al., 2007). The corpus was later extended, simplified, and transformed to a format similar to the one used by the SIGNLL Conference on Computational Natural Language Learning (CoNLL) and evolved to so-called Extended-CNEC corpus (Konkol and Konopík, 2013). The corpus in version 2.0 defining seven most commonly used entity types (numbers in addresses, geographical names, institution names, media names, artifact names, personal names, and time expressions) was used for training our NER system. This corpus has also been used by most of the other researchers so a comparison with previous research is possible.
To evaluate the impact of corpus size and quality, which are the factors influencing the quality of word vectors (Levy et al., 2015) used in the NER system, different corpora were used to learn word vectors. CWC-11 is a Czech corpus based on selected newspaper articles, blogs, and discussions on the Czech web (Spoustová and Spousta, 2012). CoNLL-2017 is a corpus released for the CoNLL conference in 2017. It contains also documents in Czech, especially from Wikipedia and other Internet sources (Zeman et al., 2018). CZES is a Czech corpus containing data from news webs from the years 1995-1998and 2002(Masaryk University, 2011. SYN-2015 is a part of the SYN corpus consisting of journalistic, technical, and fiction papers from the years 2010-2014 (Křen et al., 2016). EuroParl is a relatively small and specialized corpus containing texts related to the European parliament agenda in the years 1996-2011. Detailed characteristics of the data collections can be found in Tab. 1. The CoNLL-2017 corpus is the largest but probably of the lowest quality, according to the number of unique tokens. The CWC-2011, CZES, and SYN-2015 corpora contain a lower number of unique tokens so the tokens should appear with higher frequencies. The difference between these corpora is mainly in their size. The EuroParl corpus contains data from a specific domain and is relatively small. Because the CNEC and EuroParl corpora are rather small, the neural network implementing NER is allowed to update the word vectors (the vectors are trainable). Although this is usually not useful, updating word vectors with respect to a specific task might be a good option when the word vectors are not good enough (Hope et al., 2017). This fine-tuning for each task can also give an extra boost to the NER system performance (Collobert et al., 2011). When the other corpora are used to train word vectors, the vectors are fixed during the NER system training.
The Extended CNEC 2.0 corpus, which is primarily used to train the NER system, was used as one of the corpora for learning word vectors. The goal was to find out whether it is beneficial to compute word vectors from the corpus that is also used to train the system for the NER task when there is only a small corpus for training a NER system (with labeled named entities) available as it is expected that different NLP tasks employ the linguistic information related with other tasks (Güngör et al., 2018).
Texts from all sources were lowercased. The reason is that some of the available texts were already in lower case so we wanted to have all of them in the same form. All nonalphanumeric characters and words with less than 5 occurrences were removed as well.

The NER System
The NER system implemented to evaluate the impact of different properties of word vectors and parameters of their learning were based on the work of Žukov-Gregorič et al. (2018) who achieved state-of-the-art results on the CoNLL NER dataset. The input to the system is a sequence of word vectors and the output is an entity type label (including a label for words that are not entities). The function mapping inputs (a sequence of word vectors) to outputs (a sequence of entity type labels) is a neural network.
The first layer of the network accepts word vectors and passes them to the hidden layer. The hidden layer uses bidirectional LSTM units. The output layer converts the signal from the hidden layer using hierarchical softmax to predict an entity type for the given input. As a stochastic gradient-based optimization algorithm, Adam (Kingma and Ba, 2014) was used to learn the weights of the network. Various hyperparameters of the network were determined experimentally, see below.
Initially, the NER system used pre-trained word vectors as input. The vectors were learned on the Czech part of the CoNLL-2017 collection with word2vec using the skipgram architecture, with context window of size 10, and word vectors having 100 dimensions (Fares et al., 2017).
The values of the hyperparameters (Feurer and Hutter, 2019) of the NER system can significantly influence the results. Because the best possible achievement in the NER task was not the main goal of the research, only an acceptably good setting was found. Initially, the hyperparameters were set to the values typical for existing research (Žukov-Gregorič et al., 2018). The values were then changed (in both directions) as long as the performance (measured by the F1-measure) of the NER system was improving.
The suitable hyperparameter values were found in the order as they appear in the list below. The suitable number of epochs was found as the average number of epochs that were needed to achieve the best result for the given combination of hyperparameters (this was 12 most of the time).
The best results were achieved with the following hyperparameter setting: • dropout probability (the probability that a neuron will be randomly turned off With this setting, the system was able to achieve the value 0.6816 of the F1-measure on the CONLL test set without optimizing the process word vectors creation.

Changing Parameters During Word Vectors Training
There are a few aspects of word vectors training. They are evaluated in isolation in a sequence of experiments. In one phase, one parameter is investigated and its suitable value determined.
The following phase works with this value and focuses on another parameter. The corpora described in Section 3.1 were used to learn word vectors to evaluate the impact of different corpora sizes size and quality. The word2vec algorithm using the skipgram architecture, context windows of size 10, hierarchical softmax, minimal token frequency, and the number of epochs equal to 5 was used to learn vectors with 100 dimensions. The skipgram architecture is suitable for most of the NLP tasks, is often used by other authors, and has low computational complexity (Levy et al., 2015).
Another important parameter is the size of vectors. Generally, the bigger the vectors are, the more relations between words can be captured (Pennington et al., 2014). This was, however, demonstrated on the word analogy task and not on the NER task. The vector size is also related to the corpus size. The bigger the corpus is, the more words and relations can be contained there. The experiments, therefore, examined different corpus and vector sizes. Most of the previous works used word vectors with 100 to 300 values. We, therefore, examined 50, 100, 200, 300, and 400 dimensions which cover the mentioned interval as well as close values outside it.
Some of the commonly used text preprocessing techniques, namely lemmatization, case folding, stop words removal, and their combinations were applied to texts before learning word vectors (lemmatization and case folding should be then applied to training data for the NER system too). Lemmatization and case folding belonging to normalization techniques decrease the number of unique tokens and increase the global frequencies of the tokens. This might be important especially for small corpora containing texts in a morphologically rich language, like Czech.
Three algorithms, namely word2vec (using the CBOW and skipgram architectures), GloVe, and fastText (using the CBOW and skipgram architectures) were studied. In the experiments, different context window sizes (5, 10, and 15 words) at a fixed number of epochs (5) were examined. Subsequently, 1, 10, and 15 epochs (5 epochs were already included in the experiments with different context window sizes) of training using 10 words context window were applied to create word vectors.
For the best settings of word2vec and fast-Text algorithms, the output layer function was changed from softmax to negative sampling with 5 or 10 negative samples. In the word2vec CBOW method, summation was used together with averaging the vectors. Different n-gram sizes were studied for the fastText algorithm. Similarly to Bojanowski et al. (2017), the minimal n-gram size was 2 or 3 and the maximal size 4 or 6. In the GloVe algorithm, different exponent values in the weighting function were used.
The following list summarizes the investigated aspects of individual techniques and algorithms during word vectors learning: • all algorithms: -corpus: different corpora from Tab. 1; -number of dimensions: 50, 100, 200, 300, 400; -context window size: 5, 10, 15; -preprocessing techniques: lemmatization, case folding, stop words removal; -number of training epochs: 1, 5, 10, 15; • wor2vec and fastText: -architecture: skipgram or CBOW; -last layer function: hierarchical softmax or negative sampling (5 or 10 negative samples); • wor2vec: -CBOW aggregation: sum or average; • fastText: -character n-gram size: 2 to 6; • GloVe: -value of exponent α in the weighting function in the cost function: 0.75, 0.5, 0.25. The quality of word vectors can be measured in several ways. One of the popular approaches is the analogy task (Mikolov et al., 2013b). However, good results in this task do not have to automatically lead to good results in the NER task. The performance of NER systems is usually measured using precision and recall. The precision is calculated as the ratio of pieces of a text that were correctly labeled as an entity and the number of pieces of a text that were labeled as an entity. The recall is defined as the ratio of entities in the text that were labeled as entities and the total number of entities in the text. These measures are calculated for each category of entities to be identified and can be further combined to the F1-measure which is a harmonic mean of the precision and recall (Yadav and Bethard, 2018). The impact of changing different parameters during the investigation was thus measured by the F1measure.

RESULTS
This section provides the results from the investigation of the impact of different aspects of word vectors learning. The aspects follow the procedure described in Section 3.3.

Corpus Characteristics
First, the suitability of different corpora for creating word vectors was evaluated. The improvement against the baseline when no word vectors were used (the input contained just word identifiers) can be found in Tab. 2.
The CoNLL-2017, CZES, and CWC-2011 corpora has brought the highest improvements of the F1-measure (more than 15%) in the NER task even without focusing on the optimization of the parameters of the algorithms used. Among these three, CoNLL-2017 has brought the least improvement despite having the nighest number of tokens. This means that not only the quantity, but also quality of the corpus is important. None from the corpora used in word vectors training was able to improve the NER outcomes so they would outperform the result achieved when training the NER system with vectors pretrained on the Czech part of the CoNLL-2017 corpus, see Section 3.2 for details (the achieved F1-measure was 0.68). This means that it makes sense to focus on the details of the algorithms of word vectors training.
The best results were achieved with the CWC-2011 corpus (a large corpus with more than 600 million words) which is used in the following experiments.

Corpus Size and Word Vectors Length
The next experiment focused on determining how corpus size and word vectors length are related and how they influence the results of NER. The outcome of this experiment is summarized in Tab Most significant differences can be seen between low-and high-dimensional vectors learned on the largest corpus where smaller vectors were not sufficient for encoding all relations between words. Increasing the number of dimensions lead to improvements even for smaller corpus portions. When the dimensionality was around 300 or 400 the results stopped improving, or they even degraded. Improvements were also positively related to corpus size. While the improvements between using 1 and 50% of the corpus were around 10% in the F1measure, the differences between using 50% of the corpus and the whole corpus were marginal. Finding a suitable amount of data from which the results stop improving had thus a positive effect on computational complexity. For Czech, corpora containing hundreds of million tokens seem to be sufficient.
In the subsequent experiments, the number of dimensions was 300 because it enabled achieving the best results. Further experiments that did not change corpus size worked with 50% of texts from CWC-2011. Both decisions did not negatively influence the performance of the NER system and had favorable computational complexity and memory demands.

Text Preprocessing Techniques
The application of normalization techniques and stop words removal lead to a decreased number of unique tokens which not only affected word vectors training but also the number of out of vocabulary words (words that are recognized in the testing phase but are unknown in the training phase) in the NER system training. The effects of the application of these techniques and their combinations can be found in Tab. 4. Based on the results, it can be noted that using normalization (here, the largest effect has lemmatization) had a positive impact especially for smaller corpora used for word vectors training. For larger text collections, especially with texts of higher quality, these techniques or their combinations were not useful. Of course, the same preprocessing techniques were applied to the texts used for the NER system training. Lowercasing is considered as a baseline since all texts were lowercased for the initial experiments (an explanation is in Section 3.1).

Algorithms and Their Parameters
Until now, only the word2vec algorithm was used to train word vectors. In the following step, other algorithms and their parameters were examined. Two parameters were relevant for all three algorithms (word2vec, GloVe, and fastText): context window size and the number of training epochs. The detailed results obtained for different parameter values can be found in Tab numbers of training epochs for a fixed context window size. It can be seen that fastText dominated in all these experiments. Also the skipgram technique for both word2vec and fastText has brought better results than CBOW, which means that semantic similarity is more important than syntactic one (Mikolov et al., 2013b). A context window size had a larger impact on the results when the CBOW technique was used and only a negligible impact when using the skipgram technique. The number of training epochs seems to have no significant impact on the results. Five epochs of training lead to the best results in most of the experiments. The GloVe algorithm did not reach results comparable to word2vec or fastText even when changing the exponent α in the weighting function in the cost function from the default value 0.75 to 0.5 or 0.25.
Both word2vec and fastText using skipgram can use different functions of the output layer of the neural network -hierarchical softmax or negative sampling. When using 5 or 10 negative samples, no improvement was observed for fastText. On the other hand, five negative samples increased the value of the F1-measure from 69.06% to 70.37% for word2vec. When using the CBOW technique, the sum instead of the average of the output vectors improved the value of the F1-measure from 64.58% to 66.62% for word2vec. For fastText, the value of the F1-measure even decreased. In both cases, the results were worse than when using the skipgram technique. When changing the minimal and maximal n-gram sizes for fastText (minimal = 2 or 3, maximal = 4 or 6), no improvements were found compared to the default setting (minimal = 3, maximal = 6).

Best Algorithms Settings
After a series of experiments focusing on different aspects of word vectors preparation for the NER task were conducted, some recommendation for the settings have been identified: • using the CWC-2011 corpus for training (of course, for a specific domain, other corpora might be more suitable; however, the size and quality need to be generally considered); • training word vectors with 300 dimensions; • using lemmatization for a small amount of text for training (i.e., 1 or 10% of the corpus); • word2vec settings: skipgram, the context window size = 5, the number of training epochs = 5, negative sampling with 5 negative samples; • fastText settings: skipgram, the context window size = 5, the number of training epochs = 5, hierarchical softmax as the output layer function, minimal n-gram size = 3, maximal n-gram size = 6).
The results achieved with these recommendations are summarized in Tab. 6. We can see that lemmatization makes sense in the case that just a small corpus is available for word vectors training. When more data is used, lemmatization even worsens the performance (see the columns 10% in Tab. 6. The size of the corpus used for training had the smallest impact on the performance when using the fastText algorithm. This supports (Bojanowski et al., 2017) stating that fastText is able to learn well on smaller corpora. From a certain size, using additional texts also did not bring additional improvements in the NER task. From using 50% of the corpus, word2vec was able to provide results comparable to fastText while GloVe was about 5% behind.
Using fastText with other recommended settings of the process and fastText algorithm is therefore the best approach for preparing word vectors for the NER task for the Czech language. The NER system using word vectors prepared in this way achieved 72.47% of the F1measure (compared to 49.83% when not using word vectors and 68.16% when using setting typically used by other authors).
The detailed performance for individual named entity categories can be found in Tab. 7. The best performance was achieved for temporal expressions (usually the names of days or months) that had very similar vector representations. On the other hand, artifact names, like units of measure, currencies, norms, or product names, were recognized with the worst performance as they cover a very wide variety of expressions. These names were often composed of more tokens and the NER system did not have to recognize all of them correctly.
Compared to word2vec, fastText was able to better recognize numbers and addresses (the F1-measure for fastText was 87.85% compared 70.91% for word2vec). It is complicated to have a separate word vector for every unique token for word2vec. On the other hand, fastText composes word vectors from character n-grams so, for example, a vector for a number will be very similar to vectors of other numbers. A similar situation is in recognizing media entities, that contain many e-mail addresses.

DISCUSSION
To evaluate the impact of different settings in word vectors training, the results were compared to the results of other authors that used a similar approach. This means that they used a neural model for NER together with word vectors. The results also needed to be demonstrated on the Extended CNEC 2.0 corpus while using no additional resources (e.g., gazetteers, Brown mutual information bigram clusters, regular expressions) or engineered features (e.g., lemmas, prefixes, affixes, character n-grams, orthographic features). The availability of gazetteers or well-engineered features generally improves the NER system performance so the results of the other authors, against which the comparison is made, are not the best they ever achieved. Not having exactly the same resources also would not allow a direct comparison of the results.
Demir and Özgür (2014) use a large Czech corpus containing 636 million words and 906 thousand unique tokens (the size is similar to the CWC-2011 corpus), together with word2vec using the skipgram architecture and the context window size equal to 5 for training word vectors having 200 dimensions. Straková et al. (2016) used the same algorithm applied to the large SYN corpus for word vectors training. Tab. 8 summarizes the performance of the NER system presented in this paper and the results achieved by other authors where the results were available. The first number 49.83% represents the value of the F1-measure achieved by our system when no word vectors were provided. When word vectors trained using the baseline method (see Section 4.1) were used and suitable values of the NER model hyperparameters were found the value of the F1-measure increased to 68.16%. When focusing on the optimization of different aspects of word vectors training, the value additionally increased to 72.47%. It is obvious that focusing on the process of word vectors preparation can bring a significant improvement to the NER system performance. This is demonstrated by the comparison to the results of other researchers that did not focus on the optimization of word vectors training. Our results are compared to the outcomes achieved without additional language independent and dependent features and other modifications of the NER algorithm.

CONCLUSION
The research focused on named entity recognition in Czech where the process of preparing the data for training a NER model using modern text representations has not been investigated. The main emphasis is put on the phase of preparing word vectors for training a machine learning-based NER system.
First, the NER system inspired by the stateof-art approach was created. The input to the system was a sequence of word vectors and the output was an entity type label. The function mapping inputs (a sequence of word vectors) to outputs (a sequence of entity type labels) was a neural network. The hidden layer used bidirectional LSTM units. The output layer converted the signal from the hidden layer using hierarchical softmax to predict an entity type. As a stochastic gradient-based optimization algorithm Adam was used to learn the weights of the network.
Subsequently, attention was paid to various aspects of word vectors preparation. The suitable settings in different steps were determined in extensive experiments and included the used corpus, number of word vectors dimensions, used text preprocessing techniques, context window size and number of training epochs for word vectors training and other algorithmspecific parameters. Besides suitable values for the parameters, it has been found that a sufficiently large corpus of good quality needs to be used and the number of word vectors dimensions needs to be chosen so enough relations between words can be encoded.
It was demonstrated that focusing on the process of word vectors preparation can bring a significant improvement of the NER system performance even without using additional language independent and dependent resources.