In these days, I read about an NLP (Natural Language Processing) Library in NodeJS. The library is called natural and It is a general natural language facility for NodeJS. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, string similarity, and some inflections are currently supported.
There are different posts which describe this libs, some of them are the Shris Umbel’s blog and webdesignerdepot. In this post, I will summarize these experiences.
The installation is very simple, you can use the NodeJS package manager npm in the following way:
npm install natural
If you prefer to use the github‘s version, you can:
git clone git://github.com/NaturalNode/natural.git cd natural npm install .
Let’s see some simple functions… First of all, you have to include the library:
var nlp = require('natural');
Let’s start with the Tokenizer. What is a Tokenizer? Well, a sentence is formed by word (aka token). A tokenizer split up a string into words. The simplest tokenizer is the WordTokenizer:
var tokenizer = new nlp.WordTokenizer(); console.log(tokenizer.tokenize("This sentence is very short. It is ok."));
It splits on anything except alphabetic characters, digits, and underscores. The result is the following:
[ 'This', 'sentence', 'is', 'very', 'short', 'It', 'is', 'ok' ]
Another is the WordPunctTokenizer that splits on anything except alphabetic characters, digits, punctuation, and underscore. The previous example is:
var wordPunctTokenizer = new nlp.WordPunctTokenizer(); console.log(wordPunctTokenizer.tokenize("This sentence is very short. It is ok."));
and the result is:
[ 'This', 'sentence', 'is', 'very', 'short', '.', 'It', 'is', 'ok', '.' ]
As simple to see, the second tokenizer add the dot ‘.’ in the array.
There are other tokenizers, some of them for a specific language. For example, for Italian, it is possible to use the AggressiveTokenizerIt.
For example, we can see the LevenshteinDistance among “Davide” and “Divide”:
The result is 1.
Another interesting function is the classification. Currently, the library has two classifiers, Naive Bayes, and logistic regression. We can start with BayesClassifier:
var classifier = new nlp.BayesClassifier();
Training the model is very simple. You have to add the annotated document and use the method train. I add to the category of documents: television and radio
classifier.addDocument('I like television', 'television'); classifier.addDocument('I hate tv-series', 'television'); classifier.addDocument('Listen to the radio', 'radio'); classifier.addDocument('Change the radio program', 'radio'); classifier.train();
for predicting a new sentence, you can use the classified:
with the label television.
For using the LogisticRegressionClassifier, you have to substitute the BayesClassifier with LogisticRegressionClassifier:
var classifier2 = new nlp.LogisticRegressionClassifier(); classifier2.addDocument('I like television', 'television'); classifier2.addDocument('I hate tv-series', 'television'); classifier2.addDocument('Listen to the radio', 'radio'); classifier2.addDocument('Change the radio program', 'radio'); classifier2.train(); console.log(classifier2.classify('See television'));
Moreover, the library allows you to use the Maximum Entropy Classifier.
There are more other functions that the library has like: Stemmers, Sentiment Analysis, Phonetics, etc. Just one hint, try it.