Text mining is one of the prospering areas in data science that allows data scientist to work with textual contents – however, some common practices around text mining, such as stopwords and stemming, are not applicable to Chinese texts due to the difference in language structures.
On the other hand, a study from InternetWorld Stats showed that Chinese Language Internet users accounted for 23.2% of the World Internet users (as of December 31, 2013), which is the second largest group of users (native English users if the largest group at 28.6%). No doubt that the business world has a strong demand on text-mining skills for Chinese texts. It is important to provide knowledge and necessary tools to extend data scientist text-mining capacity to include Chinese text contents.
What am I going to get from this course?
- Know the basics of Chinese text structures: characters, vocabulary types, sentences
- Understand the computer representations of Chinese text encoding and convention: Unicode, GB, HZ, Big5
- Understand the theory for Chinese text segmentation and applying Chinese segmentation using the Jieba library
Prerequisites and Target Audience
What will students need to know or do before starting this course?
- Basic knowledge on Python development
- Basic knowledge on text mining
- Knowledge on machine learning and statistics
- Interest in learning to apply their data science skills to Chinese text documents
Who should take this course? Who should not?
This course targets data scientists who is working on natural language processing
and would like to extend into textual contents in Chinese. Students are assumed
to have basic knowledge in Python and text mining. Knowledge in Chinese
language is not a must but having interest in it will make the course easier.
Module 1: Introduction of Basic Structures of Chinese
Course Overview and Objectives
Traditional and Simplified Chinese
History of Chinese Characters
Module 2: Deep Dive into Text Segmentation
Chinese Text Simulation
Jieba Part-of-Speech Tagging
Chinese NLP in Action
Jieba Text Segmenation