Part 4: AI, Cloud and Big Data Case Studies
The following summary of interchapter dependencies for
- Chapters 11–16 assumes that you’ve read
- Chapters 1– 5 . Most of
- Chapters 11–16 also require dictionary fundamentals from Section 6.2.
- Chapter 11, Natural Language Processing (NLP), uses pandas DataFrame features from Section 7.14’s Intro to Data Science.
- Chapter 12, Data Mining Twitter, uses pandas DataFrame features from Section 7.14’s Intro to Data Science, string method join (Section 8.9), JSON fundamentals (Section 9.5), TextBlob (Section 11.2) and Word clouds (Section 11.3). Several examples require defining a class via inheritance (Chapter 10).
- Chapter 13, IBM Watson and Cognitive Computing, uses builtin function open and the with statement (Section 9.3).
- Chapter 14, Machine Learning: Classification, Regression and Clustering, uses NumPy array fundamentals and method unique (Chapter 7), pandas DataFrame features from Section 7.14’s Intro to Data Science and Matplotlib function subplots (Section 10.6).
- Chapter 15, Deep Learning, requires NumPy array fundamentals (Chapter 7), string method join (Section 8.9), general machinelearning concepts from
- Chapter 14 and features from
- Chapter 14’s Case Study: Classification with k Nearest Neighbors and the Digits Dataset.
- Chapter 16, Big Data: Hadoop, Spark, NoSQL and IoT, uses string method split(Section 6.2.7), Matplotlib FuncAnimation from Section 6.4’s Intro to Data Science, pandas Series and DataFrame features from Section 7.14’s Intro to Data Science, string method join (Section 8.9), the json module (Section 9.5), NLTK stop words (Section 11.2.13) and from Chapter 12, Twitter authentication, Tweepy’s StreamListener class for streaming tweets, and the geopy and folium libraries. A few examples require defining a class via inheritance (Chapter 10), but you can simply mimic the class definitions we provide without reading Chapter 10.
JUPYTER NOTEBOOKS
For your convenience, we provide the book’s code examples in Python source code (.py) files for use with the commandline IPython interpreter and as Jupyter Notebooks (.ipynb) files that you can load into your web browser and execute. Jupyter Notebooks is a free, opensource project that enables you to combine text, graphics, audio, video, and interactive coding functionality for entering, editing, executing, debugging, and modifying code quickly and conveniently in a web browser. According to the article, “What Is Jupyter?”:
Jupyter has become a standard for scientific research and data analysis. It packages computation and argument together, letting you build “computational narratives”; and it simplifies the problem of distributing working software to teammates and associates.
In our experience, it’s a wonderful learning environment and rapid prototyping tool. For this reason, we use Jupyter Notebooks rather than a traditional IDE, such as Eclipse, Visual Studio, PyCharm or Spyder. Academics and professionals already use Jupyter extensively for sharing research results. Jupyter Notebooks support is provided through the traditional opensource community mechanisms (see “Getting Jupyter Help” later in this Preface). See the Before You Begin section that follows this Preface for software installation details and see the testdrives in Section 1.5 for information on running the book’s examples.
https://jupyter.org/community.
Collaboration and Sharing Results
Working in teams and communicating research results are both important for developers in or moving into dataanalytics positions in industry, government or academia:
- The notebooks you create are easy to share among team members simply by copying the files or via GitHub.
- Research results, including code and insights, can be shared as static web pages via tools like nbviewer (https://nbviewer.jupyter.org) and GitHub—both automatically render notebooks as web pages.
Reproducibility: A Strong Case for Jupyter Notebooks
In data science, and in the sciences in general, experiments and studies should be reproducible. This has been written about in the literature for many years, including
- Donald Knuth’s 1992 computer science publication—Literate Programming.
Knuth, D., “Literate Programming” (PDF), The Computer Journal, British Computer Society, 1992.
- The article “LanguageAgnostic Reproducible Data Analysis Using Literate Programming,” which says, “Lir (literate, reproducible computing) is based on the idea of literate programming as proposed by Donald Knuth.”
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164023.
Essentially, reproducibility captures the complete environment used to produce results—hardware, software, communications, algorithms (especially code), data and the data’s rovenance (origin and lineage).
DOCKER
In Chapter 16, we’ll use Docker—a tool for packaging software into containers that bundle everything required to execute that software conveniently, reproducibly and portably across platforms. Some software packages we use in Chapter 16 require complicated setup and configuration. For many of these, you can download free preexisting Docker containers. These enable you to avoid complex installation issues and execute software locally on your desktop or notebook computers, making Docker a great way to help you get started with new technologies quickly and conveniently.
Docker also helps with reproducibility. You can create custom Docker containers that are configured with the versions of every piece of software and every library you used in your study. This would enable other developers to recreate the environment you used, then reproduce your work, and will help you reproduce your own results. In Chapter 16, you’ll use Docker to download and execute a container that’s preconfigured for you to code and run big data Spark applications using Jupyter Notebooks.