CRF (Conditional Random Fields) has been a popular supervised learning method before deep learning occurred, and still, it is a easy-to-use and robust machine learning algorithm. We recently used this algorithm to do NER (name entity recognition), and here is a brief summary of using CRF in Python.
Among CRF toolkits, CRF++ and CRFsuite are the most popular choices. However, CRFsuite is more robust and faster-to-train. We were told that CRF++ needs the features to be set in files, but CRFsuite can calculate features in the training. Therefore, we chose CRFsuite as the framework.
Several Python libraries provide support to CRFsuite, including python-crfsuite and sklearn-crfsuite. We chose the later one due to its comprehensive tutorial.
The sklearn-crfsuite’s tutorial can be found at github. It is easy to follow; nevertheless, the code quality cannot match production code quality, so we made a number of modifications.
Note, it does not support pandas DataFrame format as feature format.
In the sklearn-crfsuite, it puts all features in a function, which is very difficult to config in the test environment.
To fix this problem, we split it into several individual functions.
We use function load_yaml_conf to read the feature configuration from a yaml file;
we use function feature_selector to convert the configuration to the feature dictionary.
Here is a sample yaml configuration file. ‘current’ and ‘previous’ are conf_switches.
Then we use another function to calculate the current token its neighbour’s features, which are the most important parts of CRF:
In this way, users can easily change the feature set in the configuration without changing the script.
Adding extra features
One trick to boost the performance of CRF is to add extra feature dictionaries.
Say, if we want to label POS tags by using CRF, we can add an noun suffix dictionary, for example, ‘tion’ is a typical noun suffix. Therefore, we use several functions to add features from external dictionaries. In this way, we can add features in the calculation instead of inputing all features from files. This is the most noticeable difference between CRFsuite and CRF++.
Function add_one_features_list adds features from a list file, and function add_one_features_dict adds features from a key-value-pair file.
Please notice, both above functions use a special case of list/dict comprehension. Usually, we can only put if in a list/dict comprehension, but here, we add if..., else... condition in them. The order of this syntax is different from a comprehension only with an if condiction.
With these two function, we can easily add multiple external features at once.
Feeding data to CRF trainer
The next step is to feed text data with added features to the CRF trainer. Because each token is converted to dictionary, and each sentence is converted to a list, so a piece of text is therefore converted to a nested list with nested lists of dictionaries. In this case, we need to convert them respectively.
The first function below extracts features of each token, while the second one extracts labels of each token.
Setting parameters of CRF algorithm
CRF is an umbrella term for a family of algorithms. For the NER task, which is basically a sequence prediction task, the chain CRF is more suitable. Therefore we need set the specific CRF algorithm in CRFsuite. Here, we choose lbfgs CRF (Limited-memory Broyden-Fletcher-Goldfarb-Shanno), and sklearn-crfsuite will take care of the rest.
Testing CRF result
To calculate F1 score of the CRF training, we can use function metrics.flat_f1_score from sklearn.
The above code is to evaluate the classification result at word level; however, to evaluate a sequence-labelling task, we need a more comprehensive method to evaluate at the sequence level. For example, when we have a NER task, we would like to understand how many entity sequences are accurately annotated. Therefore, we created a function to achieve this goal.
sklearn-crfsuite is a very easy-to-use package for applying CRF algorithms, and this post summarizes some key steps of using it in a NER task.