Course Description
Text mining is one of the prospering areas in data science that allows data scientist to work with textual contents – however, some common practices around text mining, such as stopwords and stemming, are not applicable to Chinese texts due to the difference in language structures.
On the other hand, a study from InternetWorld Stats showed that Chinese Language Internet users accounted for 23.2% of the World Internet users (as of December 31, 2013), which is the second largest group of users (native English users if the largest group at 28.6%). No doubt that the business world has a strong demand on text-mining skills for Chinese texts. It is important to provide knowledge and necessary tools to extend data scientist text-mining capacity to include Chinese text contents.
What am I going to get from this course?
- Know the basics of Chinese text structures: characters, vocabulary types, sentences
- Understand the computer representations of Chinese text encoding and convention: Unicode, GB, HZ, Big5
- Understand the theory for Chinese text segmentation and applying Chinese segmentation using the Jieba library
Prerequisites and Target Audience
What will students need to know or do before starting this course?
- Basic knowledge on Python development
- Basic knowledge on text mining
- Knowledge on machine learning and statistics
- Interest in learning to apply their data science skills to Chinese text documents
Who should take this course? Who should not?
This course targets data scientists who is working on natural language processing
and would like to extend into textual contents in Chinese. Students are assumed
to have basic knowledge in Python and text mining. Knowledge in Chinese
language is not a must but having interest in it will make the course easier.
Curriculum
Module 1: Introduction of Basic Structures of Chinese
40:10
Lecture 1
Course Overview and Objectives
02:20
Lecture 2
Chinese Grammar
04:21
Lecture 3
Traditional and Simplified Chinese
07:32
Lecture 4
Jain-Fan Conversion
09:31
Lecture 5
Chinese Vocabulary
06:31
Lecture 6
Chinese Pinyin
05:35
Lecture 7
History of Chinese Characters
04:20
Module 2: Deep Dive into Text Segmentation
36:24
Lecture 8
Chinese Text Simulation
08:13
Lecture 9
Jieba Part-of-Speech Tagging
09:03
Lecture 10
Chinese NLP in Action
11:37
Lecture 11
Jieba Text Segmenation
07:31