In this assignment, you are a given a dataset of approximately 20,000 news documents collected from a set of newsgroups (mailing lists). The set of documents (email messages) is partitioned almost evenly across 20 different topics such as sport, electronics, politics, etc. The documents of each newsgroup are stored in one directory. Each news document is stored in a text file in a semi-structured format.
Here is a sample document:
I attached the document below