FastText: Library For Fast Text Representation And Classification (Facebook AI Research)
One of the biggest technical challenges that are being faced by the artificial intelligence researchers turns out to be the Understanding of the meaning of words that roll off the tongue as one talk's, or one's fingertips as the posts are taped out. But it is an essential need. Automatic text processing in our day-to-day interaction forms a key part with your computer; it’s a critical component of everything may it be from web search and content ranking to even spam filtering, and when it works well and in order, it becomes completely invisible to you. With the growing amount of data online, there is now a need for more flexible tools in order to understand the content of very large datasets in a better manner, and also in order to provide more accurate classification results.
To look into this important need, the Facebook AI Research lab(FAIR) is now open-sourcing a library that has been designed to help build solutions that are scalable for text representation as well as classification called as fastText. They believe that their ongoing commitment to sharing and collaboration with the community extends beyond just delivering the code to them. They believe that it’s important to share the learning's in order to advance the field, and so, it's now open-sourced and they have published their research relating to fastText.
What exactly is FastText?
FastText is as already mentioned above, an open source tool that has been developed by the Facebook AI Research lab(FAIR) lab and is a library that is dedicated to representing and classifying text in a scalable environment, and when compared to any of the other available tools, it has a faster and superior performance. The library is written in C++ but also has interfaces for other languages like Python as well as Node.js.
Now, Why FastText?
According to Facebook, “We can classify half a million sentences among 312K classes in less than a minute and train fastText on more than one billion words in less than 10 minutes using a standard multi-core CPU.” That kind of CPU-intensive classification would actually take hours to achieve when using any other machine learning tool. Deep learning tools perform well when used on small datasets, but tend to be very slow in case of large data sets, which limits their use in production environments.
At its core, fastText uses an approach of ‘bag of words’, disregarding the order of words. Also, it instead of a linear one uses a hierarchical classifier to reduce the linear time complexity to logarithmic and to be much more efficient on large data sets with a higher category count.
Also, Deep neural networks have recently become very popular for text processing. While these models achieve very good performance in laboratory practice that is limited, they tend to be slow to train and test, which limits their use on very large datasets.
fastText helps in solving this problem. it uses a hierarchical classifier instead of a flat structure, in which the different categories are organized in a tree to be efficient on datasets with a very large number of categories. This, as a result, with respect to the number of classes, reduces the time complexities of training and testing text classifiers from linear to logarithmic. The fact that classes are imbalanced by using the Huffman algorithm to build the tree used to represent categories is also exploited by FastText.
fastText: A dedicated tool:
Text classification is very important when it comes to the commercial world. There are tools such as Vowpal Wabbit or libSVM, that design model for general classification problems but coming to fastText, it is exclusively dedicated only to text classification. This allows it to be trained very quickly on extremely large datasets. There have already been results of models that have been trained on more than 1 billion words in less than 10 minutes making the use of a standard multicore CPU. fastText can also in less than five minutes classify half-million sentences among more than 300,000 categories.
It is therefore hoped that the introduction of fastText helps the community to build better, more scalable solutions for text representation and classification. Since it is being delivered as an open-source library, it is believed that fastText is a valuable addition to the research and engineering communities, which will ultimately help all design better applications and further advances in language understanding.
For More Information: GitHub