Facing The Future With Time-Tested Tools: Data Science At The Command Line

Oct. 25, 2017, 8:16 p.m. By: Kirti Bakshi

Data Science

No matter how handy the graphical user interfaces get today, our very good old command line still continues to remain a useful tool for performing the manipulation of various low-level data and system administration tasks. It is a real fallback when something that you need to do has no way of any graphical control. Being much more open-minded and expressive than a predefined set of controls, we can say that the command shell is the ultimate control environment for any computer.

Data Science at the Command Line is a book written by the Data Scientist Jeroen Janssens that covers the tools which are available at the Linux command line for doing the tasks that are related to data analysis. The book is divided theme based and systematically into chapters namely on Obtaining, Modeling, scrubbing, Interpreting Data with Intermezzo, also including chapters on parameterizing the shell scripts, parallelization using the GNU Parallel and using the Drake workflow tool.

This hands-on guide demonstrates to you how the flexibility of the command line can help you become more efficient and productive as a data scientist. You’ll learn not only on how to combine small, powerful, command-line tools but also understand on how to quickly scrub, obtain, explore, and model your data efficiently. And To get you started, whether you’re on Windows or Linux the author Jeroen Janssens introduces to you an easy-to-install virtual environment that has been packed with more than 80 command line tools- The Data Science Toolbox.

This book, therefore, is a great read which lets you Discover why the command line is a scalable and an extensible technology. It doesn't matter if you’re already comfortable processing data with R or Python, it will surely help you to greatly improve your data science workflow by leveraging the power of the command line.

About the Book:

Data science has now come to become one of the most intensely practiced computer applications, and it is no wonder as to why it benefits greatly from the hands-on control approach of the command line shell.

The original thought that acted as a motivation for the book was the desire to move away from purely GUI based approaches to data analysis. Such approaches act as a very a common desire for any data analysts as GUIs prove to be very good for a quick look, but once you start wanting to repeat the analysis or even repeat visualization for that matter, they tend to become even more troublesome.

The style in which the book is written in aims in providing a stream of practical examples for various command line tools, and to also illustrates their applications when they are strung together. The author, With real-life examples, shows you on how to use the classic Linux command line tools like cut, grep, uniq, tr and sort to your own advantage and also helps you learn on how to get data from the Internet, databases or even Microsoft Excel spreadsheets, where most of the data of the world that is operational lies hidden from one's plain sight.

You will also learn to explore data using visualizations that comprise of statistical diagrams like bar charts or box plots. So the command line is not just plain text as even though the images are generated using commands, they are shown in a window that is GUI based.

The book starts off by pitching the command line as a substitute for graphical user interface driven applications and then finishes it all by proposing the command line as a replacement for a conventional programming language which makes it a little hard to agree upon for a few.

Data Science at the Command Line is one book that is definitely worth a read if not a book that can be followed religiously as it is a showcase for what is possible rather than a reference book that tells you on how to exactly do it.

About The Author:

Jeroen Janssens is a Data Scientist, teacher and an entrepreneur who is based in Rotterdam, Netherlands. He is the Founder and the CEO of Data Science Workshops which provide on the job training and coaching in machine learning, programming and Data Visualization. He holds an MSc. in AI from Maastricht University and a Ph.D. in Machine learning from Tilburg University.

PDF Link: Data Science At The Command Line