creating datasets,stemming and stop word removal
Tokenize the data after removing stop-words and stemming.
For each data set ( not each file) count the number of time a token appears. Do not count all tokens. Create an arff (WEKA format) file for each Data set. The attribute will be token and the value will be count.