Log in
Show password
Forgot password ?
Become a member for free
Sign up
Sign up
New member
Sign up for FREE
New customer
Discover our services
Dynamic quotes 


SummaryMost relevantAll NewsOther languagesPress ReleasesOfficial PublicationsSector news

Learning Tree International : Using a Pandas Andrews Curve Plot for Multidimensional Data

10/12/2021 | 11:52am EST

Who doesn't love a scatterplot? Clear and informative, the scatterplot is often one of the very first plots used when starting an exploratory data analysis. Unfortunately, scatterplots are limited to two dimensions, three if you're using a package that supports 3D plots. This means we can look at the relationships between different columns of data two or thee at a time. No more. It is entirely possible that no matter which two-dimensional subset of data we view, important relationships may be obscured from our view. This is particularly true if we are looking for clusters within our datapoints. Clusters might be apparent to our eyes only if we had magical six-dimensional vision.

Pandas implements a small set of plotting functions specifically designed for viewing multidimensional data on a two-dimensional computer monitor. Today we will look at one of them, the Andrews curves plot, as this is a plot that is not easy to interpret the first time you see one.

In a conventional scatterplot, each datapoint is represented by a point in our plot, though they might be rendered as circles, squares or triangles. In an Andrews plot, individual datapoints become curves. These curves are essentially a finite Fourier series. Each column value for an individual multidimensional datapoint becomes a coefficient in the Fourier series. There are two very important consequences of this.

1) The shape of a curve depends on the sequence in which the columns are added. If we plot exactly the same data but with two different orders for the columns, we will get two different plots. This means that the actual shape of an Andrews curve tells us nothing.

2) We interpret an Andrews plot by examining the similarities and differences among the shapes of curves in our data. The shape of a curve means nothing, but the similarity of that shape to other curves tells us a great deal about the similarity of the datapoints.

Let's look at some examples.

We can start, as most everyone does, with the iris dataset. The andrews_curves( ) method is defined within the pandas.plotting class, and its use is particularly easy. We wish to look at the numerical data columns grouped bu the Species.

pd.plotting.andrews_curves(dfIris, 'Species')


As mentioned, every curve represents a single datapoint from the original dataset. We see one of the three bundles of curves, setosa, is distinct from the other two; there is considerable overlap between the versicolot and virginica curves. This is consistent with many other vews of the iris data and suggests that the iris data will not break down nicely into three clusters based solely on its four columns of data.

Running another plot on the same data but with different column orders we see that the plot looks different, but our conclusion does not change.

test = pd.DataFrame({"Sepal.Width":dfIris["Sepal.Width"],"Petal.Width":dfIris["Petal.Width"],"Sepal.Length":dfIris["Sepal.Length"],"Petal.Length":dfIris["Petal.Length"],"Species": dfIris["Species"]})

pd.plotting.andrews_curves(test, 'Species')


We see the shapes are different but setosa still appears distinct while there is substantial overlap between versicolor and virginica.

Since the whole point of Andrews curves is to help make sense of multidimensional data, let's look at a sample dataset with six dimensions rather than iris' four.

The Swiss Bank Note sample dataset contains 200 rows of data representing the physical dimensions of real and counterfeit banknotes. As before, we can plot the Andrews curves.

pd.plotting.andrews_curves(dfBankNotes, "Type",color=["red", "black"])


We clearly see that the shapes of the curves are virtually identical, but the position of the "counterfeit" line is shifted, suggesting that the two groups have distinct characteristics.

The plot also illustrates a potential problem; one shared with the time-honored scatterplot. If there are too many datapoints, there may be so much overlap among points that individual datapoints are not discernible. We nay wish to plot a subset of the initial data.

pd.plotting.andrews_curves(dfBankNotes.sample(n=20), "Type",color=["red", "black"])


By plotting fewer points, we accentuate the fact that each curve represents a datapoint.


Andrews curves, as implemented by Pandas, add another potentially useful tool for exploratory data analysis, particularly if we are curious abut the similarity of groups within the data.

If you would like to have fun with the Swiss Bank Notes sample dataset, you can download it here.


Learning Tree International Inc. published this content on 12 October 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 12 October 2021 15:51:04 UTC.

ę Publicnow 2021
01/11Duke Corporate Education and Learning Tree Form New Partnership to Cultivate Key Leader..
01/11LEARNING TREE INTERNATIONAL : Expand the Pipeline of Job-Ready Cloud Risk Management Profe..
01/11LEARNING TREE INTERNATIONAL : Duke Corporate Education and Learning Tree Form New Partners..
2021LEARNING TREE INTERNATIONAL : Top 10 CMMC 2.0 Questions and Answers
2021The new Cybersecurity Maturity Model Certification (CMMC)
2021LEARNING TREE INTERNATIONAL : How to Use PortableApps to Make Your Life Easier
2021SOME QUERY HINTS : Undocumented, But Interesting
2021LEARNING TREE INTERNATIONAL : How deMorgan's Theorems Can Help Programmers
2021LEARNING TREE INTERNATIONAL : Pasting Datetime Data into Excel Workbooks
2021LEARNING TREE INTERNATIONAL : Using a Pandas Andrews Curve Plot for Multidimensional Data
More news
Duration : Period :
Learning Tree International, Inc. Technical Analysis Chart | LTRE | US5220151063 | MarketScreener
Managers and Directors
David Brown Chief Executive Officer & Director
Igor Lima Chief Financial Officer
Kevin Ross Gruneich Non-Executive Chairman
Magnus Nylund Chief Operating Officer
Richard J. Surratt Independent Director
Sector and Competitors