Technical tags

Course Project Data Mining Machine Learning Java Matlab

Menu

View Project 1
View Project 2

Project 1: forest coverage prediction

This is an individual project.

Report and Source Code

Introduction

This is a project for the forest coverage prediction. The training set and testing set have been given as CSV format files. Based on these samples, a classifier should be proposed to recognize different forest cover type, i.e. the tree that covers the area is determined. The wilderness regions are divided into 30x30 meter area. In the dataset, each area is covered by one type of tree. An artificial neural network (ANN) model is used for the prediction.

Method

In the project, I use two different models based on ANN.
  1. For the first ANN model, the input layer has 54 nodes to accept 54 features from records. The hidden layer has 48 hidden nodes. The size of hidden nodes is difficult to be determined. We run experiments to get the best size of hidden nodes.
  2. In the second model, a group of ANNs is applied for classification. An intuition here is that we collect the results from these ANNs together. The classification is based on majority vote.
From the physical meaning of features, 10 more new features are generated. For example, from the feature of horizontal distance to hydrology and vertical distance to hydrology, we can get the Euclidean distance to hydrology.

Result

The best result on the training set we can get from single ANN is the one has 64 features as its input, including the 10 new introduced features. The precision on testing set is 74.73%.



By using 14 ANNs in a group, we can slightly improve the performance. The final precision we get on testing set is 74.819%. In the following figure, the x-axis is the number of ANNs. The y-axis is precision on the training set.

Project 2: text mining

This is an individual project.

Report and Source Code

Introduction

When we search a hashtag in twitter, it just returns results that contain the hashtag. In these results, they may refer different subtopics that relate to the hashtag, or they may concern other topics, because of the various meaning of one hashtag. In twitter, it does not give further analysis about how many subtopics in the search result. For a user, giving subtopics in results is an important feature for searching. In the project, I collected tweets from hashtags search result, then clustered a search result to appropriate subtopics, which could help users to discover interesting and valuable topics in massive tweets.

Method

The whole workflow in the project is from data collection to clustering.
  1. Get the raw data from the search results by using Twitter API.
  2. Tokenize each tweet, remove stop words and do some other processing.
  3. Build term-document sparse matrix.
  4. Put the term-document matrix to an SVD solver. The SVD solver outputs a transformed document matrix with reduced dimensions.
  5. Cluster on the transformed document. Use Silhouette value to evaluate the clustering result.
  6. Use the adjusted number of clusters to get subtopics.

Result

We use the dataset of #apple as an example. Because the number of tweets is huge in the dataset, we cannot directly use Silhouette value to calculate each instance in the dataset. A sampling method is applied to calculate Silhouette value. After a clustering, we get k clusters. Then we randomly choose 2000 samples from the dataset and calculate their mean Silhouette value. We do the sampling for 100 times. Then we get the average mean of Silhouette value for the k clusters. For the dataset of #apple, we get a result as the figure here. It shows that the best number of clusters is k=7, in the range between 1 to 20.

We show the clusters on the 2-dimensional plot as in the below figure. we can find out some cluster are mixed together. Because we are clustering on six-dimensional space. Here, we just show the major two-dimensional space. Some clusters are discriminated in the other dimensions.

In each cluster, I select two tweets as representatives of the group. It is not hard to find that each cluster has a main subtopic.
Group 1 (prices related news):
37415: hertz: we actually made $  million less than we thought (htz) #news #phone #apple #mobile
17174: oil prices are falling #news #phone #apple #mobile
Group 2 (something from different countries):
7462: spain collections  . intensive english s  - sit tv production unit #itunesu #itunes #iphone #apple #mac
5528: china courses  . stage   japanese   - kolbe catholic college #itunesu #itunes #iphone #apple #mac
Group 3 (entertainment news):
28937: united states game paid  . dude perfect - dude perfect error #itunes #iphone #apps #apple
67647: the game (season  ). hd premiere now on itunes. [$ . ] #apple
Group 4 (ios features):
28660: #ios   #jailbreak now compatible with macs #apple #ios  #tech #news
64760: top #ios news: no #iphone plus recall  office for iphone  #apple watch to be big bucks
Group 5 (cover protector of iPhone):
67013: #apple #iphone  - hybrid id credit card holder case cover with stand for apple iphone   plus bl... #deals ebay ca
56013: #apple #iphone  - green  . mm super thin matte transparent protective case cover for iphone :  ... #deals ebay uk
Group 6 (something about “smile”):
10899: new #apple #macbook - item - smile if you love an arts administration. sleeves for macbooks
44368: new #apple #macbook - item - smile if you love a gerontological nurse practitio macbook ...
Group 7 (user experience about apple products):
29098: fancy something a bit different? why not try the @drygate #apple #ale? brewed in glesga toon! #glasgow #craftbeer
35679: this #apple is #amazeballs good. cc @umnews @sewardcoop
    
The result verified that the proposed method can cluster similar tweets into the same group.