Glossario

Seleziona una delle parole chiave a sinistra ...

The Data Science PipelineVisualization

Momento della lettura: ~30 min

Data visualization is a way to leverage your visual cortex to gain insight into data. Because vision is such a rich and well-developed interface between the human mind and the external world, visualization is a critical tool for understanding and communicating data ideas.

The standard graphics library in Python is Matplotlib, but here we will use a newer package called Plotly. Plotly offers a number of material advantages relative to Matplotlib: (1) figures support interactions like mouseovers and animations, (2) there is support for genuine 3D graphics, and (3) Plotly is not Python-specific: it can be used directly in Javascript or in R or Julia.

If you use Plotly in a Jupyter notebook, the figures will automatically display in an interactive form. Therefore, it is recommended that you follow along using a separate tab with a Jupyter notebook. However, we will use the function show defined in the cell below to display the figures as static images so they can be viewed on this page.

from datagymnasia import show
print("Success!")

Scatter plot

We can visualize the relationship between two columns of numerical data by associating them with the horizontal and vertical axes of the Cartesian plane and drawing a point in the figure for each observation. This is called a scatter plot. In Plotly Express, scatter plots are created using the px.scatter function. The columns to associate with the two axes are identified by name using the keyword arguments x and y.

import plotly.express as px
import pydataset
iris = pydataset.data('iris')
show(px.scatter(iris,x='Sepal.Width',y='Sepal.Length'))

An aesthetic is any visual property of a plot object. For example, horizontal position is an aesthetic, since we can visually distinguish objects based on their horizontal position in a graph. We call horizontal position the x aesthetic. Similarly, the y aesthetic represents vertical position.

We say that the x='Sepal.Width' argument maps the 'Sepal.Width' variable to the x aesthetic. We can map other variables to other aesthetics, with further keyword arguments, like color and symbol:

show(px.scatter(iris,
                x='Sepal.Width',
                y='Sepal.Length',
                color='Species',
                symbol='Species'))

Note that we used the same categorical variable ('Species') to the color and symbol aesthetics.

Exercise
Create a new data frame by appending a new column called "area" which is computed as a product of petal length and width. Map this new column to the size aesthetic (keeping x, y, and color the same as above). Which species of flowers has the smallest petal area?

 

Solution. We use the assign method to add the suggested column, and we include an additiona keyword argument to map the new column to the size aesthetic.

show(px.scatter(iris.assign(area = iris["Petal.Length"] * 
                                   iris['Petal.Width']),
                x='Sepal.Width',
                y='Sepal.Length',
                color='Species',
                size='area'))

Faceting

Rather than distinguishing species by color, we could also show them on three separate plots. This is called faceting. In Plotly Express, variables can be faceted using the facet_row and facet_col arguments.

show(px.scatter(iris, 
                x = 'Sepal.Width', 
                y = 'Sepal.Length', 
                facet_col = 'Species'))

Line plots

A point is not the only geometric object we can use to represent data. A line might be more suitable if we want to help guide the eye from one data point to the next. Points and lines are examples of plot geometries. Geometries are tied to Plotly Express functions: px.scatter uses the point geometry, and px.line uses the line geometry.

Let's make a line plot using the Gapminder data set, which records life expectancy and per-capita GDP for 142 countries.

import plotly.express as px
gapminder = px.data.gapminder()
usa = gapminder.query('country == "United States"')
show(px.line(usa, x="year", y="lifeExp"))

The line_group argument allows us to group the data by country so we can plot multiple lines. Let's also map the 'continent' variable to the color aesthetic.

show(px.line(gapminder, 
             x="year", 
             y="lifeExp", 
             line_group="country", 
             color="continent"))

Exercise
Although Plotly Express is designed primarily for data analysis, it can be used for mathematical graphs as well. Use px.line to graph the function x\mapsto \operatorname{e}^x over the interval [0,5].

Hint: begin by making a new data frame with appropriate columns. You might find np.linspace useful.

 

Solution. We use np.linspace to define an array of x-values, and we exponentiate it to make a list of y-values. We package these together into a data frame and plot it with px.line as usual:

import numpy as np
import pandas as pd
x = np.linspace(0,5,100)
y = np.exp(x)
df = pd.DataFrame({'x': x, 'exp(x)': y})
show(px.line(df, x = 'x', y = 'exp(x)'))

Bar plots

Another common plot geometry is the bar. Suppose we want to know the average petal width for flowers with a given petal length. We can group by petal length and aggregate with the mean function to obtain the desired data, and then visualize it with a bar graph:

show(px.bar(iris.groupby('Petal.Length').agg('mean').reset_index(), 
            x = 'Petal.Length', 
            y = 'Petal.Width'))

We use reset_index because we want to be able to access the index column of the data frame (which contains the petal lengths), and the index is not directly accessible from Plotly Express. Resetting makes the index a normal column and replaces it with consecutive integers starting from 0.

Perhaps the most common use of the bar geometry is to make histograms. A histogram is a bar plot obtained by binning observations into intervals based on the values of a particular variable and plotting the intervals on the horizontal axis and the bin counts on the vertical axis.

Here's an example of a histogram in Plotly Express.

show(px.histogram(iris, x = 'Sepal.Width', nbins = 30))

We can control the number of bins with the nbins argument.

Exercise
Does it make sense to map a categorical variable to the color aesthetic for a histogram? Try changing the command below to map the species column to color.

show(px.histogram(iris, x = 'Sepal.Width', nbins = 30))

Solution. Yes, we can split each bar into multiple colors to visualize the contribution to each bar from each category. This works in Plotly Express:

show(px.histogram(iris, 
                  x = 'Sepal.Width', 
                  nbins = 30, 
                  color = 'Species'))

Density plots

Closely related to the histogram is a one-dimensional density plot. A density plot approximates the distribution of a variable in a smooth way, rather than the using the function mapping each x value to the height of its histogram bar.

Unfortunately, Plotly Express doesn't have direct support for one-dimensional density plots, so we'll use plotly module called the figure factory:

import plotly.figure_factory as ff
show(ff.create_distplot([iris['Sepal.Width']],['Sepal.Width']))

The figure factory takes two lists as arguments: one contains the values to use to estimate the density, and the other represents the names of the groups (in this case, we're just using one group). You'll see that the plot produced by this function contains three : the bar plot is a histogram, the line plot represents the density, and the tick marks indicate the individual variable values (the set of tick marks is called a rug plot).

If a categorical variables is mapped to the x aesthetic, the point geometry fails to make good use of plot space because all of the points will lie on a limited number of lines. As a result, it's common practice to represent the points in each category in some other way. Examples include the boxplot and the violin plot:

show(px.box(iris, x = 'Species', y = 'Petal.Width'))
show(px.violin(iris, x = 'Species', y = 'Petal.Width'))

The box plot represents the distribute of the y variable using five numbers: the min, first quartile, median, third quartile, and max. Alternatively, the min and max are sometimes replaced with upper and lower fences, and observations which lie outside are considered outliers and depicted with with points. The plot creator has discretion regarding how to calculate fence cutoffs, but one common choice for the upper fence formula is \mathrm{Q}_3 + (1.5 \cdot \mathrm{IQR}), where \mathrm{Q}_3 is the third quartile and \mathrm{IQR} is the interquartile range. The corresponding lower fence formula would be 1.5 times the .

A violin plot is similar to a boxplot, except that rather than a box, a small is drawn instead of the box-and-whisker figure.

In this section we introduced several of the main tools in a data scientist's visualization toolkit, but you will learn many others. Check out the cheatsheet for ggplot2 to see a much longer list of geometries, aesthetics, and statistical transformations.

Bruno
Bruno Bruno