基于 Stanford NLP software 的中文文本预处理

本文已被cos.name转载:http://cos.name/2016/01/intro-to-chinese-nlp/

Converting Two formats of Chinese Texts with OpenCC

Often, we have a dataset with mixed formats of Chinese characters: the simplified Chinese used in mainland China, and the traditional Chinese used in other areas. It is not a good idea to ignore the mixed usage of these two forms, because it will bring further problems in the later processing. To overcome this, we use OpenCC by BYVoid.

General Pipelines for Chinese NLP Engineering with Stanford NLP Software

The Chinese version of this article can be found here.