2021-Fall-DSC180A02-Capstone: Text Mining and NLP
Undergraduate Class, HDSI, UCSD, 2021
Class Time: Wednesdays, 1 to 1:50 PM Pacific Time. Room: https://ucsd.zoom.us/j/91491702947.
Overview
This capstone section mainly focuses on text mining and natural language processing. We will explore cutting-edge research papers in these areas together and try to replicate some experiments for a deeper, better understanding.
We will mostly have discussions in a Q&A form, instead of traditional lectures. Due to the COVID-19, the discussions will be online over Zoom.
Papers to Read
Mining Quality Phrases from Massive Text Corpora
Jialu Liu*, Jingbo Shang*, Chi Wang, Xiang Ren and Jiawei Han. SIGMOD 2015. [code]Automated Phrase Mining from Massive Text Corpora
Jingbo Shang, Jialu Liu, Meng Jiang, Xiang Ren, Clare R Voss and Jiawei Han. TKDE 2018. [code]UCPhrase: Unsupervised Context-aware Quality Phrase Tagging
Xiaotao Gu*, Zihan Wang*, Zhenyu Bi, Yu Meng, Liyuan Liu, Jiawei Han and Jingbo Shang. KDD 2021. [arXiv:2105.14078] [code]
Tips
- These three papers are highly related. Please read them one by one.
- The Github README files also provide useful information.
- PDFs of these papers can be found online or here
Schedule
Week | Date | Discussion Focus |
1 | 09/29 | General Overview (a short lecture by Jingbo Shang) |
2 | 10/06 | Introduction & Motivation |
3 | 10/13 | Datasets and Experiment Design |
4 | 10/20 | Experimental Results - Analysis |
5 | 10/27 | Experimental Results - Replication |
6 | 11/03 | Case Studies |
7 | 11/10 | Application Brainstorming |
8 | 11/17 | Possible Extension |
9 | 11/24 | Report Writing Discussion |
10 | 12/01 | Elevator Pitch |
Discussion Questions
Week 2: Introduction & Motivation
- Why do we want to study phrase mining? What’s the advantage of phrases over unigrams?
- What’s the major problem when someone is going to apply SegPhrase to a new corpus? Is there any human effort?
- What’s the motivation of AutoPhrase? Compared with SegPhrase, which parts do you believe are novel?
- What’s the motivation of UCPhrase? Compared with AutoPhrase and SegPhrase, what are the major invotations in UCPhrase?
Week 3: Datasets and Experiment Design
- How many datasets are used in the papers? How many domains and languages are covered?
- Why do we want to use such a diverse set of datasets? How this is related to the claims in the papers?
- Why do we want to evaluate the results following the pooling strategy? Think about how much human effort is required, if we are not using pooling.
- Why the UCPhrase has some different evaluation settings than AutoPhrase and SegPhrase?
Week 4: Experimental Results - Analysis
- Please outline the claims in these three papers.
- How can we understand each table and figure? What are the takeaways? One or two sentences per table/figure should be enough.
- For each claim, where are the experimental results supporting it?
Week 5: Experimental Results - Replication
- Carefully check the README file in the AutoPhrase repo. What is the relation between
autophrase.sh
andphrasal_segmentation.sh
? - Try to run AutoPhrase using the
DBLP.5k.txt
andDBLP.txt
datasets as the input corpus. It should be runnable on your laptop. Let me know if you encounter any issue. - Please eyeball the results from the two runs and try to compare them from the following aspects:
- The number of high-quality phrases (e.g., > 0.5)
- Unigram phrase vs. multi-word phrase
- Top a few high-quality phrases (e.g., >0.9) vs. those borderline phrases (e.g., ~0.5)
Week 6: Case Studies
- Why do we need case studies in addition to the quantitative results?
- How case studies further the claims in the papers?
- Do you have any interesting findings from either the case studies presented in the papers or the results you got from Week 5?
Week 7: Application Brainstorming
- What kind of applications do you think could be benefited from phrase mining? Why?
- Try to think broadly for more domains/languages.
- Based on your proposed applications, can we apply SegPhrase/AutoPhrase directly?
- Do you think there is some necessary adaption? If yes, how? If no, why?
Week 8: Possible Extension
- What are the drawbacks of these three papers? Do you see any limitations?
- Can we do better in order to address these limitations?
Week 9: Report Writing Discussion
- Do you have any questions about the final report writing?
- How to prepare informative Figures and Tables?
- How to properly cite previous work?
- How to make the proposal look more promising?
Week 10: Elevator Pitch
We will have a timed rehearsal for the evevator pitch.