In our recent projects, we gave a try of spaCy
, an under-developing Python NLP library. According to the author:
spaCy
excels at large-scale information extraction tasks. It’s written from the ground up in carefully memory-managedCython
. Independent research has confirmed thatspaCy
is the fastest in the world. If your application needs to process entire web dumps, spaCy is the library you want to be using.
Considering its speed and performance, we used it in some low-level NLP tasks, such as chunking and POS-tagging. Also, we tried spaCy
in the NER task, but, because of being a black-box, we did not use it in our production.
The exact algorithm is a pastiche of well-known methods, and is not currently described in any single publication.
Using spaCy for low-level NLP tasks
spaCy
’s official site provides a comprehensive guide to use its workflows, but in reality, we need a more flexible and robust way to use these workflows. Therefore, we can wrap them together for low-level NLP tasks.
In this function, we use two dictionaries to compile all choices together.
In the first dictionary spacy_dic
, we read specific result out from the spaCy
processing result spacy_result
object. Each result is wrapped by a comprehension, either a list comprehension or a generator comprehension. Respectively, spacy_dic['chk']
returns the sentence chunks, spacy_dic['txt']
returns the original text, spacy_dic['pos']
returns part-of-speech tagging result, and spacy_dic['dep']
returns dependency parsing result.
spaCy token API page provides all avaiable APIs, so we can add any other avaiable APIs to this dictionary.
We then use the second dictionary result
to assemble the result from the first dictionary spacy_dic
, in order to avoid unnecessary repeated computing.
The last step is to use dict
’s default key-get method to achieve the result from above two dictionaries.
Using dictionary to handle multi-choice logic
Using a dictionary structure is an elegent way to handle a complicated if..elif..else
logic.
If we do not use a dictionary strucutre here, we may use a tradiaional way to deal with this logic like this:
Which one do you like? Definitely I would choose the first one.
Final thought
spaCy
indeed provides an efficient and extensive library for NLP tasks. If you have not given a try, it is worth taking a look at it.