Working With Data in Python

Table of Contents

1. Working with Data in Python

For a long time data handling and graph production in python lagged behind that of R. Data were collected in comma or tab separated (CSV) formats or often in pickle files. Pickled data was textual data converted to a binary representation and had to be converted back to be read. These issues made it harder to use python in an interactive and interpreted environment for data analysis and many workers in psychology would collect their data with experiments coded in python (or other languages), and then use R as the language for statistical testing.

However, over the last few years the scientific python community has expanded its options for both graphing and data handling markedly, and working entirely in python is increasingly a viable option. The most widely used python library for handling data is Pandas. In many respects it has a similar look and feel to the data.frame style of data analysis used in R. However, there are definite syntactic and implementation differences. While a familiarity with R and data.frames is a good base for beginning to use Pandas there will remain a lot of concrete specfics to learn to use python effectively. In this topic a few of the pandas basics are demonstrated. In addition to the illustrations here, there are numerous online tutorials and blogs that explain pandas in more detail, and with many more examples.

Ia : Intro to Using Pandas (i2c4p) from Britt Anderson on Vimeo.

This module gives an overview of using python to request data via the internet; unzip a data file; and load that data into python with pandas for exploratory analysis.

1.1. Some additional sources

1.1.1. Pandas Books

1.2. Data handling in Python

Ib : Installing Pandas (i2c4p) from Britt Anderson on Vimeo.

There are several methods for installing Pandas. In this video I demonstrate one of the: using pip

1.2.1. Pandas

How to get Pandas?

  1. Anaconda (Conda) - you will see this in a lot of documentation. This is a collection of packages for python that are curated to work well together and allow you to set up virtual environments (which protects your projects and their differing requirements from leading to conflicts with each other). Using conda is not hard, but it is another tool, and it has its own learning curve, so it will not be used here. If you find yourself using python regularly and you are comfortable with how to install and use software and packages then I encourage you to read more about Anaconda and decided if it is a python solution that will work for you.
  2. Xubuntu stable sudo apt-get install python3-pandas.
  3. Pip (PyPI) pip3 install pandas

To try and get all our python packages from one source (and thereby reduce our chance for conflicts) we will use the pip method as that will work well with our subsequent installation of the psychopy package.

1.2.2. Interactive Pandas

Python is an interpreted language. You can use Python from an interpreter much as you used R in an R console (or RStudio).

1.2.3. Getting Your Data Into Python

Ic : Using Python to Request Data from the Internet (i2c4p) from Britt Anderson on Vimeo.

This video demonstrates some of the steps below for getting data from Python. It uses the requests library.

Before you can get data into python, you have to get data.

There are a large number of sources for data online that you can use to explore different tools and analyses.

  1. Open Psychometrics
  2. Data Links from the APA
  3. Data World
  4. The Kaggle Competition Kaggle competitions allow you to make data analysis a competitive sport.

For demonstration purposes we will get data from the Open Psychometrics project (I am using the humor styles questionnaire). You could just go to the website and right click and download, but you could also use python to get the data. There are many methods for this and the one used here came from this online post.

You will have to pip3 install requests to get the requests library

Then you will want to be in the directory where you want the file to download. I am using a :session: in order to allow variables to persist across babel code blocks.

import requests
url = "http://openpsychometrics.org/_rawdata/HSQ.zip"
r = requests.get(url)
filename = url.split('/')[-1] # -1 gets us the last item of a list, in this case the filenamel
with open(filename,'wb') as output_file:
    output_file.write(r.content)

Id : How to Use Python Unzip functions (12c4p) from Britt Anderson on Vimeo.

Using the python unzip library.

This file is a "zip" file. For a zip file the data have been compressed to save space. There are many other compression types available. Tar is fairly common. Rar is rare. Python also has facilities to unzip files. Why bother using all these tools rather than doing everything manually? Well, for one thing you could script it. Why waste your time clicking random buttons. Let the computer do the work.

import zipfile
with zipfile.ZipFile('./HSQ.zip', 'r') as zip_ref:
    zip_ref.extractall('.') ; # what does "." mean in this context?

Ie : Getting Data into Pandas (i2c4p) from Britt Anderson on Vimeo.

Reading in a csv data file to a python interactive session with Pandas.

You have to import pandas to use it. 
#+begin_src python :session *dataPython* :results value
  import pandas as pd
  dpd = pd.read_csv("./HSQ/data.csv")
  dpd.columns.values

1.2.4. Repeating Things We Did In R With Pandas

  1. The length of a list

    In R you would use the length command, but in python it is len. Almost every language you will program in will have a command for finding the length, but the actual word may be different or the syntax may be different.

    len(dpd['Q1'])
    
  2. Using a Conditional

    In R we did things like mydataframe$mydatacol to get a column of data from a data frame. In python the format looks more like a python dictionary.

    dpdmg = dpd.copy()
    dpdmg = dpdmg[dpdmg['gender'].isin([1,2])]
    len(dpdmg['Q1'])
    

    What happens if you just select with in? You keep the same number of rows, because you replace the ineligible data with NaNs (not a number).

1.2.5. Functional Styles versus Object Orientation

Python is an object oriented language. R is in a more eclectic style that reflects its LISP origins. Object oriented languages have data structures: objects, that encapsulate both attributes (what objects are like) and methods (what objects can do). A list would have its contents, the items in the list, as its attributes, but would have the ability, a method, to report the length of its self 1. The attributes and methods of a python object are often accessed by a name that includes a dot '.' like the ".isin" you see in the code snippet above. Pandas creates a data.frame object (the name emphasizes its R heritage), but it is not the same thing, and while most of the commands are achievable in either, they are not the same. The dot shows you we are accessing either an attribute or a method of an object.

2. Assessing Your Use of Pandas

2.1. Task

You will

  1. Download a data set in python
  2. Unzip the data set
  3. Tell me the names of all the columns
  4. Tell me the length of one of the rows
  5. And tell me the mean2 of one of the columns subset by rows3.

2.2. Requirements

Your submission will be either an org4 file with code blocks that let me execute each of these steps by myself interactively one at a time. Or a python script that I can run from the command line and that will perform each of these functions displaying its output to stdout. Stdout is the linux name for printing to the screen in the terminal. For example, when you do ls to list the names of files in a directory you are directing to stdout.

2.3. Comments

Don't start by writing this as a script. That will almost certainly be too hard. Begin by doing this interactively, perhaps at a python interpreter until you get the code for a particular step done corrrectly. Then copy that code to your file and and test that it still works; that something did not get messed up in the copying and pasting. Once you have the data in your interpreter, don't just rush through. Take some time to play with various pandas commands or exercises such as examples you can find here to generally grow more familiar with the package.

Footnotes:

1

Self is in italics because this is the special name one often sees in object oriented code were the definition of an object is given. Self typically refers to the particular instance of an object as it is being made.

2

Or some other simple statistic appropriate to the data, e.g. if your data was categorical you might give me the number of rows for each type of category.

3

An example of this is that if you had columns for heights and genders you could output the average height of men, women, and other.

4

To make some nicer formatting and help available when using python from within emacs checkout the elpy package.

Ec : Installing the Elpy package (i2c4p) from Britt Anderson on Vimeo.

Installing the elpy package. A bit about installing packages in emacs generally, and a specific package to help with python code formatting and syntax.

Date: 2023-02-09 Thu 00:00

Author: Britt Anderson

Created: 2023-04-23 Sun 11:53

Validate