Pattern- A Web Mining Tool for Python

Sept. 15, 2017, 8:29 p.m. By: Prakarsh Saxena

Pattern

Data Mining and Data Analysis have grown to be key players of the market today as we know it. The insights that we get from the data lying all around us shapes the market further and prompts companies and industries to take further actions or change their own strategies to get the best out of the scenario. But when this data has to be retrieved online, one needs automated programs to extract them from the sources. For Python, this task is eased by Pattern- the package specifically designed for web mining and analysis.

About Pattern

Pattern was primarily developed by Tom De Smedt and Walter Daelemans and is a collection of Open Source (BSD license) web mining modules for Python from the Computational Linguistics and Psycholinguistics Research Center. It contains tools for data retrieval, text analysis, data visualization and comes with over 30 sample scripts for users to get an idea about its implementation.

It primarily has tools for

  • Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM parser

  • Natural Language Processing: part- of- speech taggers, n- gram search, sentiment analysis, WordNet

  • Machine Learning: Vector Space Model, Clustering, Classification( KNN, SVM, Perceptron)

  • Network Analysis: Graph Centrality and Visualisation

Pattern supports Python 2.7 and Python 3.6+. The Python 3 version is currently only available on the development branch. To install Pattern so that it is available in all your scripts, unzip the download and from the command line do:

cd pattern-2.6
python setup.py install

If you have pip, you can automatically download and install from the PyPI repository:

pip install pattern

If none of the above works, you can make Python aware of the module in three ways:

1. Put the pattern folder in the same folder as your script.

2. Put the pattern folder in the standard location for modules so it is available to all scripts:

  • c:\python26\Lib\site-packages\ (Windows),

  • /Library/Python/2.6/site-packages/ (Mac OS X),

  • /usr/lib/python2.6/site-packages/ (Unix).

3. Add the location of the module to sys.path in your script, before importing it:

MODULE= ‘/users/tom/desktop/pattern’
import sys; if MODULE not in sys.path:sys.path.append(MODULE)
from pattern.en import parsetree

The whole package consists of six main modules:

  • pattern.web: A toolkit that includes APIs for various Web services, including Google, Gmail, Bing, Twitter Wikipedia and Flickr. It has its own HTML parser and Web Crawler.

  • pattern.table: A module for working with tabular data, used for storing data from the pattern.web module.

  • pattern.en: A natural language processing toolkit for English.pattern.search: A module containing a search algorithm.

  • pattern.vector: A module containing various tools for analyzing the text of a document.

  • pattern.graph: A module for data visualization using Canvas.

For more information about the Python package, you can visit their Github page here.