
By JuTT BaDshaH

1. Introduction to Computers and Python
Objectives
In this chapter you’ll:
- Learn about exciting recent developments in computing.
- Review objec-toriented programming basics.
- Understand the strengths of Python.
- Be introduced to key Python and datascience libraries you’ll use in this book.
- Testdrive the IPython interpreter’s interactive mode for executing Python code.
- Execute a Python script that animates a bar chart.
- Create and testdrive a web-browser-based Jupyter Notebook for executing Python code.
- Learn how big “big data” is and how quickly it’s getting even bigger.
- Read a bigdata case study on a popular mobile navigation app.
- Be introduced to artificial intelligence—at the intersection of computer science and data science.
Outline
1.1 Introduction
1.2 A Quick Review of Object Technology Basics
1.3 Python
1.4 It’s the Libraries!
1.4.1 Python Standard Library
1.4.2 Data-Science Libraries
1.5 Test-Drives: Using IPython and Jupyter Notebooks
1.5.1 Using IPython Interactive Mode as a Calculator
1.5.2 Executing a Python Program Using the IPython Interpreter
1.5.3 Writing and Executing Code in a Jupyter Notebook
1.6 The Cloud and the Internet of Things
1.6.1 The Cloud
1.6.2 Internet of Things
1.7 How Big Is Big Data?
1.7.1 Big Data Analytics
1.7.2 Data Science and Big Data Are Making a Difference: Use Cases
1.8 Case Study—A BigData Mobile Application
1.9 Intro to Data Science: Artificial Intelligence—at the Intersection of CS and Data Science
1.10 Wrap-Up
1.1 INTRODUCTION
Welcome to Python—one of the world’s most widely used computer programming languages and, according to the Popularity of Programming Languages (PYPL) Index, the world’s most popular.
- https://pypl.github.io/PYPL.html (as of January 2020).
Here, we introduce terminology and concepts that lay the ground-work for the Python programming you’ll learn in Chapters 2–10 and the bigdata, artificialintelligence and cloud-based case studies we present in Chapters 11–16.
We’ll review object-oriented programming terminology and concepts. You’ll learn why Python has become so popular. We’ll introduce the Python Standard Library and various data-science libraries that help you avoid “reinventing the wheel.” You’ll use these libraries to create software objects that you’ll interact with to perform significant tasks with modest numbers of instructions.
Next, you’ll work through three test-drives showing how to execute Python code:
- In the first, you’ll use IPython to execute Python instructions interactively and immediately see their results.
- In the second, you’ll execute a substantial Python application that will display an animated bar chart summarizing rolls of a sixsided die as they occur. You’ll see the “ Law of Large Numbers” in action. In Chapter 6, you’ll build this application with the Matplotlib visualization library.
- In the last, we’ll introduce Jupyter Notebooks using JupyterLab—an interactive, web-browser-based tool in which you can conveniently write and execute Python instructions. Jupyter Notebooks enable you to include text, images, audios, videos, animations and code.
In the past, most computer applications ran on standalone computers (that is, not networked together). Today’s applications can be written with the aim of communicating among the world’s billions of computers via the Internet. We’ll introduce the Cloud and the Internet of Things (IoT), laying the ground-work for the contemporary applications you’ll develop in Chapters 11–16.
You’ll learn just how big “big data” is and how quickly it’s getting even bigger. Next, we’ll present a bigdata case study on the Waze mobile navigation app, which uses many current technologies to provide dynamic driving directions that get you to your destination as quickly and as safely as possible. As we walk through those technologies, we’ll mention where you’ll use many of them in this book. The chapter closes with our first Intro to Data Science section in which we discuss a key intersection between computer science and data science—artificial intelligence.
1.2 A QUICK REVIEW OF OBJECT TECHNOLOGY BASICS
As demands for new and more powerful software are soaring, building software quickly, correctly and economically is important. Objects, or more precisely, the classes objects come from, are essentially reusable software components. There are date objects, time objects, audio objects, video objects, automobile objects, people objects, etc. Almost any noun can be reasonably represented as a software object in terms of attributes (e.g., name, color and size) and behaviors (e.g., calculating, moving and communicating). Softwaredevelopment groups can use a modular, object-oriented design-andimplementation approach to be much more productive than with earlier popular techniques like “structured programming.” Object-oriented programs are often easier to understand, correct and modify.
Automobile as an Object
To help you understand objects and their contents, let’s begin with a simple analogy. Suppose you want to drive a car and make it go faster by pressing its accelerator pedal. What must happen before you can do this? Well, before you can drive a car, someone has to design it. A car typically begins as engineering drawings, similar to the blueprints that describe the design of a house. These drawings include the design for an accelerator pedal.
The pedal hides from the driver the complex mechanisms that make the car go faster, just as the brake pedal “hides” the mechanisms that slow the car, and the steering wheel “hides” the mechanisms that turn the car. This enables people with little or no knowledge of how engines, braking and steering mechanisms work to drive a car easily. Just as you cannot cook meals in the blueprint of a kitchen, you cannot drive a car’s engineering drawings. Before you can drive a car, it must be built from the engineering drawings that describe it. A completed car has an actual accelerator pedal to make it go faster, but even that’s not enough—the car won’t accelerate on its own (hopefully!), so the driver must press the pedal to accelerate the car.
Methods and Classes
Let’s use our car example to introduce some key object-oriented programming concepts. Performing a task in a program requires a method. The method houses the program statements that perform its tasks. The method hides these statements from its user, just as the accelerator pedal of a car hides from the driver the mechanisms of making the car go faster. In Python, a program unit called a class houses the set of methods that perform the class’s tasks. For example, a class that represents a bank account might contain one method to deposit money to an account, another to withdraw money from an account and a third to inquire what the account’s balance is. A class is similar in concept to a car’s engineering drawings, which house the design of an accelerator pedal, steering wheel, and so on.
Instantiation
Just as someone has to build a car from its engineering drawings before you can drive a car, you must build an object of a class before a program can perform the tasks that the class’s methods define. The process of doing this is called instantiation. An object is then referred to as an instance of its class.
Reuse
Just as a car’s engineering drawings can be reused many times to build many cars, you can reuse a class many times to build many objects. Reuse of existing classes when building new classes and programs saves time and effort. Reuse also helps you build more reliable and effective systems because existing classes and components often have undergone extensive testing, debugging and performance tuning. Just as the notion of interchangeable parts was crucial to the Industrial Revolution, reusable classes are crucial to the software revolution that has been spurred by object technology.
In Python, you’ll typically use a buildingblock approach to create your programs. To avoid reinventing the wheel, you’ll use existing high-quality pieces wherever possible. This software reuse is a key benefit of objec-toriented programming.
Messages and Method Calls
hen you drive a car, pressing its gas pedal sends a message to the car to perform a task—that is, to go faster. Similarly, you send messages to an object. Each message is implemented as a method call that tells a method of the object to perform its task. For example, a program might call a bankaccount object’s deposit method to increase the account’s balance.
Attributes and Instance Variables
A car, besides having capabilities to accomplish tasks, also has attributes, such as its color, its number of doors, the amount of gas in its tank, its current speed and its record of total miles driven (i.e., its odometer reading). Like its capabilities, the car’s attributes are represented as part of its design in its engineering diagrams (which, for example, include an odometer and a fuel gauge). As you drive an actual car, these attributes are carried along with the car. Every car maintains its own attributes. For example, each car knows how much gas is in its own gas tank, but not how much is in the tanks of other cars.
An object, similarly, has attributes that it carries along as it’s used in a program. These attributes are specified as part of the object’s class. For example, a bank-account object has a balance attribute that represents the amount of money in the account. Each bank-account object knows the balance in the account it represents, but not the balances of the other accounts in the bank. Attributes are specified by the class’s instance variables. A class’s (and its object’s) attributes and methods are intimately related, so classes wrap together their attributes and methods.
Inheritance
A new class of objects can be created conveniently by inheritance—the new class (called the subclass) starts with the characteristics of an existing class (called the superclass), possibly customizing them and adding unique characteristics of its own. In our car analogy, an object of class “convertible” certainly is an object of the more general class “automobile,” but more specifically, the roof can be raised or lowered.
Object-Oriented Analysis and Design (OOAD)
Soon you’ll be writing programs in Python. How will you create the code for your programs? Perhaps, like many programmers, you’ll simply turn on your computer and start typing. This approach may work for small programs (like the ones we present in the early chapters of the book), but what if you were asked to create a software system to control thousands of automated teller machines for a major bank? Or suppose you were asked to work on a team of 1,000 software developers building the next generation of the U.S. air traffic control system? For projects so large and complex, you should not simply sit down and start writing programs.
To create the best solutions, you should follow a detailed analysis process for determining your project’s requirements (i.e., defining what the system is supposed to do), then develop a design that satisfies them (i.e., specifying how the system should do it). Ideally, you’d go through this process and carefully review the design (and have your design reviewed Why other software professionals) before writing any code. If this process involves analyzing and designing your system from an objectoriented point of view, it’s called an object-oriented analysisanddesign (OOAD) process. Languages like Python are object-oriented. Programming in such a language, called object-oriented programming (OOP), allows you to implement an objectoriented design as a working system.
1.5 TEST-DRIVES: USING IPYTHON AND JUPYTER NOTEBOOKS
In this section, you’ll testdrive the IPython interpreter in two modes:
Before reading this section, follow the instructions in the Before You Begin section to install the Anaconda Python distribution, which contains the IPython interpreter.
- In interactive mode, you’ll enter small bits of Python code called snippets and immediately see their results.
- In script mode, you’ll execute code loaded from a file that has the .py extension (short for Python). Such files are called scripts or programs, and they’re generally longer than the code snippets you’ll use in interactive mode.
Then, you’ll learn how to use the browserbased environment known as the Jupyter Notebook for writing and executing Python code.
Jupyter supports many programming languages by installing their “kernels.” For more information see https://github.com/jupyter/jupyter/wiki/Jupyter-kernels.
1.5.1 Using IPython Interactive Mode as a Calculator
Let’s use IPython interactive mode to evaluate simple arithmetic expressions.
Entering IPython in Interactive Mode
First, open a commandline window on your system:
- On macOS, open a Terminal from the Applications folder’s Utilities subfolder.
- On Windows, open the Anaconda Command Prompt from the start menu.
- On Linux, open your system’s Terminal or shell (this varies by Linux distribution).
In the commandline window, type ipython, then press Enter (or Return). You’ll see text like the following, this varies by platform and by IPython version:
Python 3.7.0 | packaged by JuTT BaDshaH | (default, Dec 10, 2020, 13:11:52)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 An enhanced Interactive Python. Type '?'
for help.
In [1]:
The text "In [1]:" is a prompt, indicating that IPython is waiting for your input. You can type ? for help or begin entering snippets, as you’ll do momentarily.
Evaluating Expressions
In interactive mode, you can evaluate expressions:
In [1]: 45 + 72
Out[1]: 117
In [2]:
After you type 45 + 72 and press Enter, IPython reads the snippet, evaluates it and prints its result in Out[1]. Then IPython displays the In [2] prompt to show that it’s waiting for you to enter your second snippet. For each new snippet, IPython adds 1 to the number in the square brackets. Each In [1] prompt in the article indicates that we’ve started a new interactive session. We generally do that for each new section of a chapter.
Let’s evaluate a more complex expression:
In [2]: 5 * (12.7 4) / 2
Out[2]: 21.75
Python uses the asterisk (*) for multiplication and the forward slash (/) for division. As in mathematics, parentheses force the evaluation order, so the parenthesized expression (12.7-4) evaluates first, giving 8.7. Next, 5 * 8.7 evaluates giving 43.5. Then, 43.5 / 2 evaluates, giving the result 21.75, which IPython displays in Out[2]. Whole numbers, like 5, 4 and 2, are called integers. Numbers with decimal points, like 12.7, 43.5 and 21.75, are called floatingpoint numbers.
Exiting Interactive Mode
To leave interactive mode, you can:
- Type the exit command at the current In [] prompt and press Enter to exit immediately.
- Type the key sequence <Ctrl> + d (or <control> + d). This displays the prompt "Do you really want to exit ([y]/n)?". The square brackets around y indicate that it’s the default response—pressing Enter submits the default response and exits.
- Type <Ctrl> + d (or <control> + d) twice (macOS and Linux only).
1.5.2 Executing a Python Program Using the IPython Interpreter
In this section, you’ll execute a script named RollDieDynamic.py that you’ll write in Chapter 6. The .py extension indicates that the file contains Python source code. The script RollDieDynamic.py simulates rolling a sixsided die. It presents a colorful animated visualization that dynamically graphs the frequencies of each die face.
Changing to This Chapter’s Examples Folder
You’ll find the script in the article’s ch01 sourcecode folder. In the Before You Begin section you extracted the examples folder to your user account’s Documents folder. Each chapter has a folder containing that chapter’s source code. The folder is named ch##, where ## is a two-digit chapter number from 01 to 17. First, open your system’s command-line window. Next, use the cd (“change directory”) command to change to the ch01 folder:
- On macOS/Linux, type cd ~/Documents/examples/ch01, then press Enter.
- On Windows, type cd C:\Users\YourAccount\Documents\examples\ch01, then press Enter.
Executing the Script
To execute the script, type the following command at the command line, then press Enter:
ipython RollDieDynamic.py 6000 1
The script displays a window, showing the visualization. The numbers 6000 and 1 tell this script the number of times to roll dice and how many dice to roll each time. In this case, we’ll update the chart 6000 times for 1 die at a time.
For a sixsided die, the values 1 through 6 should each occur with “equal likelihood”—the probability of each is 1/6 or about 16.667%. If we roll a die 6000 times, we’d expect about 1000 of each face. Like coin tossing, die rolling is random, so there could be some faces with fewer than 1000, some with 1000 and some with more than 1000. We took the screen captures below during the script’s execution. This script uses randomly generated die values, so your results will differ. Experiment with the script by changing the value 1 to 100, 1000 and 10000. Notice that as the number of die rolls gets larger, the frequencies zero in on 16.667%. This is a phenomenon of the “Law of Large Numbers.”
![]() |
| BY JuTT BaDshaH |
Creating Scripts
Problems That May Occur at Execution Time
1.5.3 Writing and Executing Code in a Jupyter Notebook
The Anaconda Python Distribution that you installed in the Before You Begin section comes with the Jupyter Notebook—an interactive, browserbased environment in which you can write and execute code and intermix the code with text, images and video. Jupyter Notebooks are broadly used in the datascience community in particular and the broader scientific community in general. They’re the preferred means of doing Pythonbased data analytics studies and reproducibly communicating their results. The Jupyter Notebook environment supports a growing number of programming languages.
For your convenience, all of the article’s source code also is provided in Jupyter Notebooks that you can simply load and execute. In this section, you’ll use the JupyterLab interface, which enables you to manage your notebook files and other files that your notebooks use (like images and videos). As you’ll see, JupyterLab also makes it convenient to write code, execute it, see the results, modify the code and execute it again. You’ll see that coding in a Jupyter Notebook is similar to working with IPython—in fact, Jupyter Notebooks use IPython by default. In this section, you’ll create a notebook, add the code from Section 1.5.1 to it and execute that code.
Opening JupyterLab in Your Browser
To open JupyterLab, change to the ch01 examples folder in your Terminal, shell or Anaconda Command Prompt (as in Section 1.5.2), type the following command, then press Enter (or Return):
jupyter lab
This executes the Jupyter Notebook server on your computer and opens JupyterLab in your default web browser, showing the ch01 folder’s contents in the File Browser tab at the left side of the JupyterLab interface:
The Jupyter Notebook server enables you to load and run Jupyter Notebooks in your web browser. From the JupyterLab Files tab, you can doubleclick files to open them in the right side of the window where the Launcher tab is currently displayed. Each file you open appears as a separate tab in this part of the window. If you accidentally close your browser, you can reopen JupyterLab by entering the following address in your web browser
Creating a New Jupyter Notebook
In the Launcher tab under Notebook, click the Python 3 button to create a new Jupyter Notebook named Untitled.ipynb in which you can enter and execute Python 3 code. The file extension .ipynb is short for IPython Notebook—the original name of the Jupyter Notebook.
Renaming the Notebook
Rename Untitled.ipynb as TestDrive.ipynb:
1. Rightclick the Untitled.ipynb tab and select Rename Notebook.
2. Change the name to TestDrive.ipynb and click RENAME.
The top of JupyterLab should now appear as follows:
Evaluating an Expression
The unit of work in a notebook is a cell in which you can enter code snippets. By default, a new notebook contains one cell—the rectangle in the TestDrive.ipynb notebook—but you can add more. To the cell’s left, the notation [ ]: is where the Jupyter Notebook will display the cell’s snippet number after you execute the cell. Click in the cell, then type the expression
45 + 72
To execute the current cell’s code, type Ctrl + Enter (or control + Enter). JupyterLab executes the code in IPython, then displays the results below the cell:
Adding and Executing Another Cell
Let’s evaluate a more complex expression. First, click the + button in the toolbar above the notebook’s first cell—this adds a new cell below the current one:
Click in the new cell, then type the expression
5 * (12.7 4) / 2
and execute the cell by typing Ctrl + Enter (or control + Enter):
Saving the Notebook
If your notebook has unsaved changes, the X in the notebook’s tab will change to . To save the notebook, select the File menu in JupyterLab (not at the top of your browser’s window), then select Save Notebook.
Notebooks Provided with Each Chapter’s Examples
For your convenience, each chapter’s examples also are provided as readytoexecute notebooks without their outputs. This enables you to work through them snippetbysnippet and see the outputs appear as you execute each snippet.
So that we can show you how to load an existing notebook and execute its cells, let’s reset the TestDrive.ipynb notebook to remove its output and snippet numbers. This will return it to a state like the notebooks we provide for the subsequent chapters’ examples. From the Kernel menu select Restart Kernel and Clear All Outputs..., then click the RESTART button. The preceding command also is helpful whenever you wish to reexecute a notebook’s snippets. The notebook should now appear as follows:
From the File menu, select Save Notebook, then click the TestDrive.ipynb tab’s X button to close the notebook.
Opening and Executing an Existing Notebook
When you launch JupyterLab from a given chapter’s examples folder, you’ll be able to open notebooks from that folder or any of its subfolders. Once you locate a specific notebook, doubleclick it to open it. Open the TestDrive.ipynb notebook again now. Once a notebook is open, you can execute each cell individually, as you did earlier in this section, or you can execute the entire notebook at once. To do so, from the Run menu select Run All Cells. The notebook will execute the cells in order, displaying each cell’s output below that cell.
Closing JupyterLab
When you’re done with JupyterLab, you can close its browser tab, then in the Terminal, shell or Anaconda Command Prompt from which you launched JupyterLab, type Ctrl + c (or control + c) twice.
JupyterLab Tips
While working in JupyterLab, you might find these tips helpful:
- If you need to enter and execute many snippets, you can execute the current cell and add a new one below it by typing Shift + Enter, rather than Ctrl + Enter (or control + Enter).
- As you get into the later chapters, some of the snippets you’ll enter in Jupyter Notebooks will contain many lines of code. To display line numbers within each cell, select Show line numbers from JupyterLab’s View menu.
More Information on Working with JupyterLab
JupyterLab has many more features that you’ll find helpful. We recommend that you read the Jupyter team’s introduction to JupyterLab at:
For a quick overview, click Overview under GETTING STARTED. Also, under USER GUIDE read the introductions to The JupyterLab Interface, Working with Files, Text Editor and Notebooks for many additional features.
1.6 THE CLOUD AND THE INTERNET OF THINGS
1.6.1 The Cloud
More and more computing today is done “in the cloud”—that is, distributed across the Internet worldwide. Many apps you use daily are dependent on cloudbased services that use massive clusters of computing resources (computers, processors, memory, disk drives, etc.) and databases that communicate over the Internet with each other and the apps you use. A service that provides access to itself over the Internet is known as a web service. As you’ll see, using cloudbased services in Python often is as simple as creating a software object and interacting with it. That object then uses web services that connect to the cloud on your behalf.
Throughout the Chapters 11–16 examples, you’ll work with many cloudbased services:
- In Chapters 12 and 16, you’ll use Twitter’s web services (via the Python library Tweepy) to get information about specific Twitter users, search for tweets from the last seven days and receive streams of tweets as they occur—that is, in real time.
- In Chapters 11 and 12, you’ll use the Python library TextBlob to translate text between languages. Behind the scenes, TextBlob uses the Google Translate web service to perform those translations.
- In Chapter 13, you’ll use the IBM Watson’s Text to Speech, Speech to Text and Translate services. You’ll implement a traveler’s assistant translation app that enables you to speak a question in English, transcribes the speech to text, translates the text to Spanish and speaks the Spanish text. The app then allows you to speak a Spanish response (in case you don’t speak Spanish, we provide an audio file you can use), transcribes the speech to text, translates the text to English and speaks the English response. Via IBM Watson demos, you’ll also experiment with many other Watson cloudbased services in Chapter 13.
- In Chapter 16, you’ll work with Microsoft Azure’s HDInsight service and other Azure web services as you implement bigdata applications using Apache Hadoop and Spark. Azure is Microsoft’s set of cloudbased services.
- In Chapter 16, you’ll use the Dweet.io web service to simulate an Internetconnected thermostat that publishes temperature readings online. You’ll also use a webbased service to create a “dashboard” that visualizes the temperature readings over time and warns you if the temperature gets too low or too high.
- In Chapter 16, you’ll use a webbased dashboard to visualize a simulated stream of live sensor data from the PubNub web service. You’ll also create a Python app that visualizes a PubNub simulated stream of live stockprice changes.
Mashups
The applicationsdevelopment methodology of mashups enables you to rapidly develop powerful software applications by combining (often free) complementary web services and other forms of information feeds—as you’ll do in our IBM Watson traveler’s assistant translation app. One of the first mashups combined the realestate listings provided by http://www.craigslist.org with the mapping capabilities of Google Maps to offer maps that showed the locations of homes for sale or rent in a given area.
ProgrammableWeb ( http://www.programmableweb.com/ ) provides a directory of over 20,750 web services and almost 8,000 mashups. They also provide howto guides and sample code for working with web services and creating your own mashups. According to their website, some of the most widely used web services are Facebook, Google Maps, Twitter and YouTube.
1.6.2 Internet of Things
The Internet is no longer just a network of computers—it’s an Internet of Things (IoT). A thing is any object with an IP address and the ability to send, and in some cases receive, data automatically over the Internet. Such things include:
- a car with a transponder for paying tolls,
- monitors for parkingspace availability in a garage,
- a heart monitor implanted in a human,
- water quality monitors,
- a smart meter that reports energy usage,
- radiation detectors,
- item trackers in a warehouse,
- mobile apps that can track your movement and location,
- smart thermostats that adjust room temperatures based on weather forecasts and activity in the home, and
- intelligent home appliances.
According to statista.com, there are already over 23 billion IoT devices in use today, and there could be over 75 billion IoT devices in 2025.
1.7 HOW BIG IS BIG DATA?
For computer scientists and data scientists, data is now as important as writing programs.
According to IBM, approximately 2.5 quintillion bytes (2.5 exabytes) of data are created daily, and 90% of the world’s data was created in the last two years. According to IDC, the global data supply will reach 175 zettabytes (equal to 175 trillion gigabytes or 175 billion terabytes) annually by 2025. Consider the following examples of various popular data measures.
- https://www.ibm.com/blogs/watson/2016/06/welcome-to-the-world-of-a-i/.
- https://www.networkworld.com/article/3325397/storage/idc-expect-175zettabytes-of-data-worldwide-by-2025.html.
Megabytes (MB)
One megabyte is about one million (actually 2 ) bytes. Many of the files we use on a daily basis require one or more MBs of storage. Some examples include:
- MP3 audio files—Highquality MP3s range from 1 to 2.4 MB per minute.
https://www.audiomountain.com/tech/audio-file-size.html.
- Photos—JPEG format photos taken on a digital camera can require about 8 to 10 MB per photo.
- Video—Smartphone cameras can record video at various resolutions. Each minute of video can require many megabytes of storage. For example, on one of our iPhones, the Camera settings app reports that 1080p video at 30 framespersecond (FPS) requires 130 MB/minute and 4K video at 30 FPS requires 350 MB/minute.
Gigabytes (GB)
One gigabyte is about 1000 megabytes (actually 2 bytes). A duallayer DVD can store up to 8.5 GB , which translates to:
https://en.wikipedia.org/wiki/DVD.
- as much as 141 hours of MP3 audio,
- approximately 1000 photos from a 16megapixel camera,
- approximately 7.7 minutes of 1080p video at 30 FPS, or
- approximately 2.85 minutes of 4K video at 30 FPS.
The current highestcapacity Ultra HD Bluray discs can store up to 100 GB of video. Streaming a 4K movie can use between 7 and 10 GB per hour (highly compressed).
https://en.wikipedia.org/wiki/Ultra_HD_Bluray.
Terabytes (TB)
One terabyte is about 1000 gigabytes (actually 2 bytes). Recent disk drives for desktop computers come in sizes up to 15 TB, which is equivalent to:
https://www.zdnet.com/article/worldsbiggest-hard-drive-meet-western-digitals15tb-monster/.
- approximately 28 years of MP3 audio,
- approximately 1.68 million photos from a 16megapixel camera,
- approximately 226 hours of 1080p video at 30 FPS and
- approximately 84 hours of 4K video at 30 FPS.
Nimbus Data now has the largest solidstate drive (SSD) at 100 TB, which can store 6.67 times the 15TB examples of audio, photos and video listed above.
Petabytes, Exabytes and Zettabytes
There are nearly four billion people online creating about 2.5 quintillion bytes of data each day —that’s 2500 petabytes (each petabyte is about 1000 terabytes) or 2.5 exabytes (each exabyte is about 1000 petabytes). According to a March 2016 AnalyticsWeek article, within five years there will be over 50 billion devices connected to the Internet (most of them through the Internet of Things, which we discuss in Sections 1.6.2 and 16.8) and by 2020 we’ll be producing 1.7 megabytes of new data every second for every person on the planet. At today’s numbers (approximately 7.7 billion people ), that’s about
- 13 petabytes of new data per second,
- 780 petabytes per minute,
- 46,800 petabytes (46.8 exabytes) per hour and
- 1,123 exabytes per day—that’s 1.123 zettabytes (ZB) per day (each zettabyte is about 1000 exabytes).
That’s the equivalent of over 5.5 million hours (over 600 years) of 4K video every day or pproximately 116 billion photos every day!
Additional Big-Data Stats
For an entertaining realtime sense of big data, check out https://www.internetlivestats.com, with various statistics, including the numbers so far today of
- Google searches.
- Tweets.
- Videos viewed on YouTube.
- Photos uploaded on Instagram.
You can click each statistic to drill down for more information. For instance, they say over 250 billion tweets were sent in 2018.
Some other interesting bigdata facts:
- Every hour, YouTube users upload 24,000 hours of video, and almost 1 billion hours of video are watched on YouTube every day.
https://www.brandwatch.com/blog/youtubestats/.
- Every second, there are 51,773 GBs (or 51.773 TBs) of Internet traffic, 7894 tweets sent, 64,332 Google searches and 72,029 YouTube videos viewed.
http://www.internetlivestats.com/onesecond.
- On Facebook each day there are 800 million “likes,” 60 million emojis are sent, and there are over two billion searches of the more than 2.5 trillion Facebook posts since the site’s inception.
https://mashable.com/2017/07/17/facebookworldemojiday/.
https://mashable.com/2017/07/17/facebookworldemojiday/.
https://techcrunch.com/2016/07/27/facebookwillmakeyoutalk/.
- In June 2017, Will Marshall, CEO of Planet, said the company has 142 satellites that image the whole planet’s land mass once per day. They add one million images and seven TBs of new data each day. Together with their partners, they’re using machine learning on that data to improve crop yields, see how many ships are in a given port and track eforestation. With respect to Amazon deforestation, he said: “Used to be we’d wake up after a few years and there’s a big hole in the Amazon. Now we can literally count every tree on the planet every day.”
https://www.bloomberg.com/news/videos/20170630/learning-from-planets-shoe-boxedsize-dsatellitesvideo, June 30, 2017.
Domo, Inc. has a nice infographic called “Data Never Sleeps 6.0” showing how much data is generated every minute, including:
https://www.domo.com/learn/dataneversleeps6.
- 473,400 tweets sent.
- 2,083,333 Snapchat photos shared.
- 97,222 hours of Netflix video viewed.
- 12,986,111 million text messages sent.
- 49,380 Instagram posts.
- 176,220 Skype calls.
- 750,000 Spotify songs streamed.
- 3,877,140 Google searches.
- 4,333,560 YouTube videos watched.
Computing Power Over the Years
Data is getting more massive and so is the computing power for processing it. The performance of today’s processors is often measured in terms of FLOPS (floatingpoint operations per second). In the early to mid1990s, the fastest supercomputer speeds were measured in gigaflops (10 FLOPS). By the late 1990s, Intel produced the first teraflop (10 FLOPS) supercomputers. In the earlytomid 2000s, speeds reached hundreds of teraflops, then in 2008, IBM released the first petaflop (10 FLOPS) supercomputer. Currently, the fastest supercomputer—the IBM Summit, located at the Department of Energy’s (DOE) Oak Ridge National Laboratory (ORNL)—is capable of 122.3 petaflops.
Distributed computing can link thousands of personal computers via the Internet to produce even more FLOPS. In late 2016, the Folding@home network—a distributed network in which people volunteer their personal computers’ resources for use in disease research and drug design —was capable of over 100 petaflops. Companies like IBM are now working toward supercomputers capable of exaflops (10 FLOPS).
- https://en.wikipedia.org/wiki/Folding@home.
- https://en.wikipedia.org/wiki/FLOPS.
- https://www.ibm.com/blogs/research/2017/06/supercomputingweather-modelexascale/.
The Quantum Computers now under development theoretically could operate at 18,000,000,000,000,000,000 times the speed of today’s “conventional computers”! This number is so extraordinary that in one second, a quantum computer theoretically could do staggeringly more calculations than the total that have been done by all computers since the world’s first computer appeared. This almost unimaginable computing power could wreak havoc with blockchainbased cryptocurrencies like Bitcoin. Engineers are already rethinking blockchain to prepare for such massive increases in computing power.
- https://medium.com/@n.biedrzycki/onlygod-can-count-that-fast-the-world-ofquantum-computing-406a0a91fcf4.
- https://singularityhub.com/2017/11/05/isquantum-computing-an-existential-threatto-blockchain-technology/.
The history of supercomputing power is that it eventually works its way down from research labs, where extraordinary amounts of money have been spent to achieve those performance numbers, into “reasonably priced” commercial computer systems and even desktop computers, laptops, tablets and smartphones.
Computing power’s cost continues to decline, especially with cloud computing. People used to ask the question, “How much computing power do I need on my system to deal with my peak processing needs?” Today, that thinking has shifted to “Can I quickly carve out on the cloud what I need temporarily for my most demanding computing chores?” You pay for only what you use to accomplish a given task.
Processing the World’s Data Requires Lots of Electricity
Data from the world’s Internetconnected devices is exploding, and processing that data requires tremendous amounts of energy. According to a recent article, energy use for processing data in 2015 was growing at 20% per year and consuming approximately three to five percent of the world’s power. The article says that total dataprocessing power consumption could reach 20% by 2025.
Another enormous electricity consumer is the blockchainbased cryptocurrency Bitcoin. Processing just one Bitcoin transaction uses approximately the same amount of energy as powering the average American home for a week! The energy use comes from the process Bitcoin “miners” use to prove that transaction data is valid.
According to some estimates, a year of Bitcoin transactions consumes more energy than many countries. Together, Bitcoin and Ethereum (another popular blockchainbased platform and cryptocurrency) consume more energy per year than Israel and almost as much as Greece.
Morgan Stanley predicted in 2018 that “the electricity consumption required to create cryptocurrencies this year could actually outpace the firm’s projected global electric vehicle demand—in 2025.” This situation is unsustainable, especially given the huge interest in blockchainbased applications, even beyond the cryptocurrency explosion. The blockchain community is working on fixes.
Big-Data Opportunities
The bigdata explosion is likely to continue exponentially for years to come. With 50 billion computing devices on the horizon, we can only imagine how many more there will be over the next few decades. It’s crucial for businesses, governments, the military and even individuals to get a handle on all this data.
It’s interesting that some of the best writings about big data, data science, artificial intelligence and more are coming out of distinguished business organizations, such as J.P. Morgan, McKinsey and more. Big data’s appeal to big business is undeniable given the rapidly accelerating accomplishments. Many companies are making significant investments and getting valuable results through technologies in this article, such as big data, machine learning, deep learning and naturallanguage processing. This is forcing competitors to invest as well, rapidly increasing the need for computing professionals with datascience and computer science experience. This growth is likely to continue for many years.
1.7.1 Big Data Analytics
Data analytics is a mature and welldeveloped academic and professional discipline. The term “data analysis” was coined in 1962, though people have been analyzing data using statistics for thousands of years going back to the ancient Egyptians. Big data analytics is a more recent phenomenon—the term “big data” was coined around 2000.
Consider four of the V’s of big data :
There are lots of articles and papers that add many other Vwords to this list.
- Volume—the amount of data the world is producing is growing exponentially.
- Velocity—the speed at which that data is being produced, the speed at which it moves through organizations and the speed at which data changes are growing quickly.
- Variety—data used to be alphanumeric (that is, consisting of alphabetic characters, digits, punctuation and some special characters)—today it also includes images, audios, videos and data from an exploding number of Internet of Things sensors in our homes, businesses, vehicles, cities and more.
- Veracity—the validity of the data—is it complete and accurate? Can we trust that data when making crucial decisions? Is it real?
Most data is now being created digitally in a variety of types, in extraordinary volumes and moving at astonishing velocities. Moore’s Law and related observations have enabled us to store data economically and to process and move it faster—and all at rates growing exponentially over time. Digital data storage has become so vast in capacity, cheap and small that we can now conveniently and economically retain all the digital data we’re creating. That’s big data.
https://www.forbes.com/sites/gilpress/2013/05/28/averyshort-historyofdatascience/.] following Richard W. Hamming quote—although from 1962—sets the tone for the rest of this article:
“The purpose of computing is insight, not numbers.”
https://www.forbes.com/sites/gilpress/2013/05/28/avery-short-history-of-datascience/.]
Data science is producing new, deeper, subtler and more valuable insights at a remarkable pace. It’s truly making a difference. Big data analytics is an integral part of the answer. We address big data infrastructure in Chapter 16 with handson case studies on NoSQL databases, Hadoop MapReduce programming, Spark, realtime Internet of Things (IoT) stream programming and more.
Turck, M., and J. Hao, Great Power, Great Responsibility: The 2018 Big Data & AI Landscape, http://mattturck.com/big-data-2018/
1.7.2 Data Science and Big Data Are Making a Difference: Use Cases
Lewis, M., Moneyball: The Art of Winning an Unfair Game (W. W. Norton & Company, 2004).
Data-science use cases
- anomaly detection
- assisting people with disabilities
- autoinsurance risk prediction
- automated closed captioning
- automated image captions
- automated investing
- autonomous ships
- brain mapping
- caller identification
- cancer diagnosis/treatment
- carbon emissions reduction
- classifying handwriting
- computer vision
- credit scoring
- crime: predicting locations
- facial recognition
- fitness tracking
- fraud detection
- game playing
- genomics and healthcare
- Geographic Information Systems(GIS)
- GPS Systems
- health outcome improvement
- hospital readmission reduction
- human genome sequencing
- identitytheft prevention
- predicting weather-sensitive product sales
- predictive analytics
- preventative medicine
- preventing disease outbreaks
- reading sign language
- real-estate valuation
- recommendation systems
- reducing overbooking
- ride sharing
- risk minimization
- robo financial advisors
- security enhancements
- Crime: predicting recidivism
- crime: predictive policing
- crime: prevention
- CRISPR gene editing
- cropyield improvement
- customer churn
- customer experience
- customer retention
- customer satisfaction
- customer service
- customer service agents
- customized diets
- cybersecurity
- data mining
- data visualization
- detecting new viruses
- diagnosing breast cancer
- diagnosing heart
- disease
- diagnostic medicine
- immunotherapy
- insurance pricing
- intelligent assistants
- Internet of Things (IoT) and medical device monitoring
- Internet of Things and weather forecasting
- inventory control
- language translation
- locationbased services
- loyalty programs
- malware detection
- mapping
- marketing
- marketing analytics
- music generation
- naturallanguage translation
- new pharmaceuticals
- opioid abuse prevention
- personal assistants
- personalized medicine
- personalized shopping
- phishing elimination
- pollution reduction
- precision medicine
- predicting cancer survival
- selfdriving cars
- sentiment analysis
- sharing economy
- similarity detection
- smart cities
- smart homes
- smart meters
- smart thermostats
- smart traffic control
- social analytics
- social graph analysis
- spam detection
- spatial data analysis
- sports recruiting and coaching
- stock market forecasting
- student performance assessment
- summarizing text
- telemedicine
- terrorist attack prevention
- theft prevention
- travel recommendations
- trend spotting
- visual product search
- disaster-victim identification
- drones
- dynamic driving routes
- dynamic pricing
- electronic health records
- emotion detection
- energy-consumption reduction
- predicting disease outbreaks
- predicting health outcomes
- predicting student enrollments
- voice recognition
- voice search
- weather forecasting
1.8 CASE STUDY—A BIG-DATA MOBILE APPLICATION
Google’s Waze GPS navigation app, with its 90 million monthly active users, is one of the most successful bigdata apps. Early GPS navigation devices and apps relied on static maps and GPS coordinates to determine the best route to your destination. They could not adjust dynamically to changing traffic situations.
Waze processes massive amounts of crowdsourced data—that is, the data that’s continuously supplied by their users and their users’ devices worldwide. They analyze this data as it arrives to determine the best route to get you to your destination in the least amount of time. To accomplish this, Waze relies on your smartphone’s Internet connection. The app automatically sends location updates to their servers (assuming you allow it to). They use that data to dynamically reroute you based on current traffic conditions and to tune their maps. Users report other information, such as roadblocks, construction, obstacles, vehicles in breakdown lanes, police locations, gas prices and more. Waze then alerts other drivers in those locations. Waze uses many technologies to provide its services. We’re not privy to how Waze is implemented, but we infer below a list of technologies they probably use. You’ll use many of these in Chapters 11–16. For example,
- Most apps created today use at least some opensource software. You’ll take advantage of many opensource libraries and tools through-out this article.
- Waze communicates information over the Internet between their servers and their users’ mobile devices. Today, such data often is transmitted in JSON (JavaScript Object Notation) format, which we’ll introduce in Chapter 9 and use in subsequent chapters. The JSON data is typically hidden from you by the libraries you use.
- Waze uses speech synthesis to speak driving directions and alerts to you, and speech recognition to understand your spoken commands. We use IBM Watson’s speech-synthesis and speechrecognition capabilities in Chapter 13.
- Once Waze converts a spoken naturallanguage command to text, it must determine the correct action to perform, which requires natural language processing (NLP). We present NLP in Chapter 11 and use it in several subsequent chapters.
- Waze displays dynamically updated visualizations such as alerts and maps. Waze also enables you to interact with the maps by moving them or zooming in and out. We create dynamic visualizations with Matplotlib and Seaborn through-out the article, and we display interactive maps with Folium in Chapters 12 and 16.
- Waze uses your phone as a streaming Internet of Things (IoT) device. Each phone is a GPS sensor that continuously streams data over the Internet to Waze. In Chapter 16, we introduce IoT and work with simulated IoT streaming sensors.
- Waze receives IoT streams from millions of phones at once. It must process, store and analyze that data immediately to update your device’s maps, to display and speak relevant alerts and possibly to update your driving directions. This requires massively parallel processing capabilities implemented with clusters of computers in the cloud. In Chapter 16, we’ll introduce various bigdata infrastructure technologies for receiving streaming data, storing that big data in appropriate databases and processing the data with software and hardware that provide massively parallel processing capabilities.
- Waze uses artificialintelligence capabilities to perform the dataanalysis tasks that enable it to predict the best routes based on the information it receives. In Chapters 14 and 15 we use machine learning and deep learning, respectively, to analyze massive amounts of data and make predictions based on that data.
- Waze probably stores its routing information in a graph database. Such databases can efficiently calculate shortest routes. We introduce graph databases, such as Neo4J, in Chapter 16.
- Many cars are now equipped with devices that enable them to “see” cars and obstacles around them. These are used, for example, to help implement automated braking systems and are a key part of selfdriving car technology. Rather than relying on users to report obstacles and stopped cars on the side of the road, navigation apps could take advantage of cameras and other sensors by using deeplearning computervision techniques to analyze images “on the fly” and automatically report those items. We introduce deep learning for computer vision in Chapter 15.
1.9 INTRO TO DATA SCIENCE: ARTIFICIAL INTELLIGENCE—AT THE INTERSECTION OF CS AND DATA SCIENCE
Artificial-Intelligence Milestones
- In a 1997 match between IBM’s DeepBlue computer system and chess Grandmaster Gary Kasparov, DeepBlue became the first computer to beat a reigning world chess champion under tournament conditions. IBM loaded DeepBlue with hundreds of thousands of grandmaster chess games. DeepBlue was capable of using brute force to evaluate up to 200 million moves per second! This is big data at work. IBM received the Carnegie Mellon University Fredkin Prize, which in 1980 offered $100,000 to the creators of the first computer to beat a world chess champion.
- https://en.wikipedia.org/wiki/Deep_Blue_versus_Garry_Kasparov.
- https://en.wikipedia.org/wiki/Deep_Blue_(chess_computer).
- In 2011, IBM’s Watson beat the two best human Jeopardy! players in a $1 million match. Watson simultaneously used hundreds of languageanalysis techniques to locate correct answers in 200 million pages of content (including all of Wikipedia) requiring four terabytes of storage. Watson was trained with machine learning and reinforcement-learning techniques. Chapter 13 discusses IBM Watson and Chapter 14 discusses machinelearning.
- Go—a board game created in China thousands of years ago —is widely considered to be one of the most complex games ever invented with 10 possible board configurations. To give you a sense of how large a number that is, it’s believed that there are (only) between 10 and 10 atoms in the known universe! In 2015, AlphaGo—created by Google’s DeepMind group—used deep learning with two neural networks to beat the European Go champion Fan Hui. Go is considered to be a far more complex game than chess. Chapter 15 discusses neural networks and deep learning.
- More recently, Google generalized its AlphaGo AI to create AlphaZero—a gameplaying AI that teaches itself to play other games. In December 2017, AlphaZero learned the rules of and taught itself to play chess in less than four hours using reinforcement learning. It then beat the world champion chess program, Stockfish 8, in a 100game match—winning or drawing every game. After training itself in Go for just eight hours, AlphaZero was able to play Go vs. its AlphaGo predecessor, winning 60 of 100 games.
A Personal Anecdote
Watson and Big Data Open New Possibilities
When Paul and I started working on this Python article, we were immediately drawn to IBM’s Watson using big data and artificialintelligence techniques like natural language processing (NLP) and machine learning to beat two of the world’s best human Jeopardy! players. We realized that Watson could probably handle problems like the sequence predictor because it was loaded with the world’s street maps and a whole lot more. That chet our appetite for digging in deep on big data and today’s artificialintelligence technologies, and helped shape Chapters 11–16 of this article. It’s notable that all of the datascience implementation case studies in Chapters 11–16 either are rooted in artificial intelligence technologies or discuss the big data hardware and software infrastructure that enables computer scientists and data scientists to implement leading-edge AI-based solutions effectively.
AI: A Field with Problems But No Solutions
For many decades, AI has been viewed as a field with problems but no solutions. That’s because once a particular problem is solved people say, “Well, that’s not intelligence, it’s just a computer program that tells the computer exactly what to do.” However, with machine learning (Chapter 14) and deep learning (Chapter 15) we’re not preprogramming solutions to specific problems. Instead, we’re letting our computers solve problems by learning from data—and, typically, lots of it. Many of the most interesting and challenging problems are being pursued with deep learning. Google alone has thousands of deeplearning projects underway and that number is growing quickly. As you work through this article, we’ll introduce you to many edge-of-the-practice artificial intelligence, big data and cloud technologies.
1.10 WRAP-UP
In this chapter, we introduced terminology and concepts that lay the groundwork for the Python programming you’ll learn in
Chapters 2–10 and the bigdata, artificialintelligence and cloudbased case studies we present in Chapters 11–16.
We reviewed objectoriented programming concepts and discussed why Python has become so popular. We introduced the Python Standard Library and various datascience libraries that help you avoid “reinventing the wheel.” In subsequent chapters, you’ll use these libraries to create software objects that you’ll interact with to perform significant tasks with modest numbers of instructions. You worked through three testdrives showing how to execute Python code with the IPython interpreter and Jupyter Notebooks. We introduced the Cloud and the Internet of Things (IoT), laying the groundwork for the contemporary applications you’ll develop in Chapters 11– 6.
We discussed just how big “big data” is and how quickly it’s getting even bigger, and resented a big-data case study on the Waze mobile navigation app, which uses many current technologies to provide dynamic driving directions that get you to your destination as quickly and as safely as possible. We mentioned where in this article you’ll use many of those technologies. The chapter closed with our first Intro to Data Science section in which we discussed a key intersection between computer science and data science—artificial intelligence.
Next Chapter
IF YOU HAVE ANY PROBLEMS PLEASE CONTACT ME ON WHATSAPP.

