Data analysis

In most cases you don’t need to be able to program to analyse the data. You could do all your analyses in Excel and SPSS.

On the other hand, a Python script might be useful in performing repeated analyses. Do you never tweak and re-run an analysis?

The scientific python (scipy) eco-system

There are many different libraries for data analysis in Python. This forms an expansive eco-system of different functionalities. Google “XXX python” and someone is likely to have implemented it.

We will discuss four of the core libraries on which all the rest rely:

  • numpy : Fast multi-dimensional arrays, basic numeric operations

  • pandas : Very similar to numpy but supporting named columns

  • scipy : Many basic scientific functions

  • matplotlib : Visualization

  • ipython : interactive analysis

numpy

This library is at the peak of the ecosystem. Everything depends on this. It includes an implementation of multi-dimensional arrays with different data-types.

This is a very simple concept, but also a very powerful one!

It also contains some basic mathematical functions to operate on these arrays

Creating numpy arrays

To create an array, you may pass a sequence of elements into the numpy.array function:

import numpy as np
my_array = np.array([1,2,3])

This object has a shape attribute, that reports on its size:

print(my_array.shape)

And a dtype attribute, that reports on its data-type:

print(my_array.dtype)

This doesn’t seem very interesting, but consider a slightly more complicated array:

my_array = np.array([[1, 2, 3], [4, 5, 6]])
print(my_array.shape)

You can automatically create an array with a range of numbers:

first_way = np.arange(0, 10.5, 0.5)
another_way = np.linspace(0, 10, 21)
print(first_way)
print(another_way)

And you can determine the shape you want it to have:

ten_by_ten = np.arange(100).reshape((10, 10))
print(ten_by_ten)

Manipulating numpy arrays

Array arithmetic

Arrays can be used to do math. Let’s demonstrate with a simple array containing integer values:

a = np.array([[1, 2, 3,], [4, 5, 6]])
print(a.shape)
(2, 3)

Math between an array and a scalar applies the computation between the scalar and each element of the array:

a2 = a + 2
print(a2)
array([[3, 4, 5], [6, 7, 8]])

Exercise: If you have an array that contains image values, how would you double its contrast?

Math between arrays proceeds element-by-element.

Therefore, it can only be done for arrays with the same shape:

b = np.array([[6, 5, 4], [3, 2, 1]])
c = a + b
print(c)
[[7 7 7]
 [7 7 7]]

This can be done with other binary operators as well:

d = a ** b
print(d)
[[ 1 32 81]
[64 25  6]]

Many mathematical functions

There are many functions in the numpy name-space to do operations on arrays:

print(np.mean(d))
print(np.mean(d, axis=0))
print(np.sqrt(d))

Array linear algebra

Arrays can be treated as 2-dimensional matrices and we can do matrix operations between them.

Note

We’ve already seen that for addition, because matrix addition is simply element-by-element addition. What about matrix multiplication?

There are a couple of different ways to do this, but the simplest is using the numpy.dot function. This works for linear (1D) arrays:

a = np.array([1, 2, 3, 4, 5, 6, 7])
b = np.array([7, 6, 5, 4, 3, 2, 1])
c = np.dot(a,b)

As well as for multi-dimensional arrays:

a = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])
b = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
c = np.dot(a, b)

Can you multiply np.dot(b, a)?

How about if this was the case?:

b = np.array([[1, 2], [3, 4], [5, 6]])

What happens when the array has more than 2 dimensions?

Special kinds of arrays

A few useful kinds of arrays:

ones33 = np.ones([3, 3])
zeros33 = np.zeros([3, 3])
eye33 = np.eye(3)
empty33 = np.empty([3, 3])

Why would you need an empty array?

Logical operations with arrays

Logical operations with arrays produces arrays of boolean dtype:

a = np.array([1, 2, 3, 4, 5, 6, 7])
b = np.array([7, 6, 5, 4, 3, 2, 1])
c = (a==b)
print(c)

This can be used for indexing:

print(a[a==b])

You can’t directly use this as a test though. Try:

if a==b:
   print("a is equal to b")

Why do you think that doesn’t work?

Instead, you can use np.all or np.any:

print(np.any(a==b))
print(np.all(a==b))
print(np.all(a==a))
print(np.any(a==a))

The scipy library

This contains an implementation of many of the core computational routines you need in scientific computing:

  • scipy.cluster : Vector quantization / Kmeans

  • scipy.constants : Physical and mathematical constants

  • scipy.fftpack : Fourier transform

  • scipy.integrate : Integration routines

  • scipy.interpolate : Interpolation

  • scipy.io : Data input and output

  • scipy.linalg : Linear algebra routines

  • scipy.ndimage : n-dimensional image package

  • scipy.odr : Orthogonal distance regression

  • scipy.optimize : Optimization

  • scipy.signal : Signal processing

  • scipy.sparse : Sparse matrices

  • scipy.spatial : Spatial data structures and algorithms

  • scipy.special : Any special mathematical functions

  • scipy.stats : Statistics

Short example

import scipy.fftpack as fft
my_ft = fft.fft(np.random.randn(2,100))
print(my_ft)

Lot’s of useful examples and study materials to be found here

Plotting in matplotlib

Matplotlib is very general and customizable. For most usages, you don’t actually need to interact with all that power. There are several different interfaces to the objects and functions implemented in MPL:

  • matplotlib - raw access to the plotting library. useful for extending

    matplotlib or doing very custom things

  • pylab - Matlab-like interface to matplotlib

  • pyplot - Object-oriented interface to matplotlib => use this one!

Here, we will teach you the one true way to use MPL, which is the object oriented interface implemented in pyplot

pyplot and subplots

To create a figure and start plotting data:

import matplotlib.pyplot as plt
fig, ax = plt.subplots(1)

The two objects returned from this call are a matplotlib.figure.Figure and matplotlib.axes.AxesSubplot. They each have multiple methods that can now be used.

For example:

t = np.linspace(-6*np.pi, 6*np.pi, 100)
ax.plot(t, np.sin(t)/t)
plt.show()
../../_images/plot_nolabel1.png

Use the object’s ‘setter’ functions, to set various attrbutes of the arrays. For example:

ax.set_xlabel('Time')
ax.set_ylabel('Amplitude')
../../_images/plot_wlabels1.png

Other kinds of plots

The subplots command can return an array of axes:

fig, ax = plt.subplots(2,2)

We can plot different kinds of plots in each one:

x = np.arange(0, 100)
y = np.random.rand(100)  # 100 random numbers
ax[0, 0].hist(y)
ax[0, 1].scatter(x, y)
ax[1, 0].boxplot(y)
ax[1, 1].loglog(x, y)
../../_images/subplots1.png

Plotting images

We can read images from files into arrays:

img1 = plt.imread('images/lena.png')

Or generate our own 2D arrays:

img2 = np.random.rand(128, 128)
fig, (ax1, ax2) = plt.subplots(1, 2)
ax1.imshow(img1)
ax2.imshow(img2)
ax1.set_axis_off()  # Hide "spines" on first axis

More resources

Take a look at the Matplotlib gallery for many examples.

Each thumb-nail in the gallery contains a link to a page that has the source code that created that image. Use the code as a starting point for your own visualization.

Interactive computing with ipython

The basic functionality of ipython is a more pleasant python interperter that runs in the terminal. It has many things that make life easier

But there are a few other features:

  • Interactive shells (terminal Qt-based).

  • Interactive notebook.

  • Interactive data visualization.

  • Parallel computing.

Distributions of the scipy ecosystem

There are a couple of easy ways to install all of these things together on any platform:

They both contain package managers, which will help you install other libraries. They’re both free for academic use.

PsychoPy outputs

Various options:

  • csv file for the experiment (one trial per line)

  • csv or excel ‘summaries’ format (can only be saved by individual loops, not the ExperimentHandler). Not really recommended any more.

  • log files (not for analysis though)

  • psydat files (literally a saved copy of the Experiment/TrialHandler)
    • can save fresh copies of the csv file from here too

    • in future should be able to export a copy of the original script that collected the data as well

Find your files

There are two simple ways to find files. You could search for all files in a folder using a module called glob:

import glob
filenames = glob.glob("data/*.csv")

Or you could open a file dialog using PsychoPy:

from psychopy import gui
filenames = gui.fileOpenDlg(allowed="*.csv")

Either way you get a list of file paths a list (which might be empty).

It’s also useful to have something to manipulate filenames. We’ll use os.path for that:

from os import path
from psychopy import gui
filenames = gui.fileOpenDlg(allowed="*.csv")
for thisFilename in filenames:
    thisPath, thisFullName = path.split(thisFilename)
    fileNoExt, fileExt = path.splitext(thisFilename)
    print(thisPath, thisFullName)
    print(fileNoExt, fileExt)

Load a csv file

Use a similar trick to the last one but we’re going to use another library called pandas (http://pandas.pydata.org/)

from os import path
from psychopy import misc, gui
import pandas as pd

filenames = gui.fileOpenDlg(allowed="*.csv")
for thisFilename in filenames:
    print(thisFilename)
    thisDat = pd.read_csv(thisFilename)
    print(thisDat)

Boom! We’ve got our data, as easily as that!

But, really, how easily can we use that data? Let’s try to pull out just the reaction times:

for thisFilename in filenames:
    print(thisFilename)
    thisDat = pd.read_csv(thisFilename)
    print(thisDat['rt'])

We can also select parts of the data that fulfill certain criteria. Let’s get rid of trials where rt>1.0 (not ready?) and corr==0:

for thisFilename in filenames:
    print(thisFilename)
    thisDat = pd.read_csv(thisFilename)
    #filter out bad data
    filtered = thisDat[ thisDat['rt']<=1.0 ]
    filtered = filtered[ filtered['corr']==1 ]
    print(filtered['rt'])

OK, from our filtered data we need the mean and std.dev. reaction time for each condition:

import scipy
from scipy import stats
...

conflict = filtered[filtered.descr == 'conflict']
congruent = filtered[filtered.descr != 'conflict']
#get mean/std.dev
meanConfl = scipy.mean(conflict['rt'])
sdConfl = scipy.std(conflict['rt'], ddof=1) # ddof=1 means /sqrt(N-1)
meanCongr = scipy.mean(congruent['rt'])
sdCongr = scipy.std(congruent['rt'], ddof=1)
print("Conflict = %.3f (sd=%.3f)" %(meanConfl, sdConfl))
print("Congruent = %.3f (sd=%.3f)" %(meanCongr, sdCongr))

Yes, but, I mean really, is that significant?:

t, p = stats.ttest_ind(conflictRT, congruentRT)
print("Independent samples t-test: t=%.3f, p=%.4f")

(Note that this is doing the statistics for a single participant, not the stats across a group).

Working copies of those files

You can fetch working copies of those analysis scripts, and data file to test them on from here:

https://github.com/psychopy/posner

Load a psydat file

You could open this for analysis, but for now let’s use it to try saving out new copies of data (make sure your undergrads didn’t fudge their results?!)

from psychopy import misc
dat = misc.fromFile( someFileNameHere )
dat.saveAsWideText( newFileNameHere )

OK, let’s open a set of psydat files and output new copies of the csv file:

from os import path
from psychopy import misc, gui
filenames = gui.fileOpenDlg(allowed="*.psydat")
for thisFilename in filenames:
    fileNoExt, fileExt = path.splitext(thisFilename)
    newName = fileNoExt+"NEW.csv"
    dat = misc.fromFile(thisFilename)
    dat.saveAsWideText(newName)
    print('saved', newName)

Plotting the Posner experiment data

Let’s apply our knowledge to the data from the posner experiment:

fig, ax = plt.subplots(1)
ax.bar([1,2], [meanConfl, meanCongr], yerr=[sdConfl, sdCongr])
plt.show()

If you want to save the Figure:

fig.savefig('my_figure.png')
fig.savefig('my_figure.pdf')

Back to top