Pandas Data Cleaning Cheat Sheet

30-04-2021 admin

Being able to look up and use functions fast allows us to achieve a certain flow when writing code. So I’ve created this cheatsheet of functions from python pandas.

This is not a comprehensive list but contains the functions I use most, an example, and my insights as to when it’s most useful.

PYTHON PANDAS Cheat Sheet by sanjeev95 via cheatography.com/111141/cs/21621/ install and import installing pandas pip install pandas pip install pandas import pandas as pd Reading and describing pd - pandas df- dataframe to read a file into a dataframe df= pd.rea dc sv( 'fi len ame') look at the first 5 lines df.he ad to describe df. Cleaning dirty data using Pandas and Jupyter notebook. There is more to life than a million rows - fact. Most data journalists start in excel, then progress to SQL and so forth but once your data swells in size most people struggle to clean millions of rows of dirty data.

Load CSV

Example data frame State Capital Population a Texas Austin 28700000 b New York Albany 19540000 c Washington Olympia 7536000 Pandas Reference Sheet POWERED BY THE SCIENTISTS AT THE DATA INCUBATOR Selecting and iltering SELECTING COLUMNS df‘State’—selects ‘State’ column df‘State’, ‘Population’—selects ‘State’. Python For Data Science Cheat Sheet Pandas Basics Learn Python for Data Science Interactively at www.DataCamp.com Pandas DataCamp Learn Python for Data Science Interactively Series DataFrame 4 Index 7-5 3 d c b A one-dimensional labeled array a capable of holding any data type Index Columns A two-dimensional labeled data structure with columns.

If you want to run these examples yourself, download the Anime recommendation dataset from Kaggle, unzip and drop it in the same folder as your jupyter notebook.

Next Run these commands and you should be able to replicate my results for any of the below functions.

Convert a CSV directly into a data frame. Sometimes loading data from a CSV also requires specifying an encoding (ie:encoding='ISO-8859–1'). It’s the first thing you should try if your data frame contains unreadable characters.

Another similar function also exists called pd.read_excel for excel files.

Build data frame from inputted data

Useful when you want to manually instantiate simple data so that you can see how it changes as it flows through a pipeline. Imovie old version free download for mac.

Copy a data frame

Useful when you want to make changes to a data frame while maintaining a copy of the original. It’s good practise to copy all data frames immediately after loading them.

Save to CSV

This dumps to the same directory as the notebook. I’m only saving the 1st 5 rows below but you don’t need to do that. Again, df.to_excel() also exists and functions basically the same for excel files.

Get top or bottom n records

Display the first n records from a data frame. I often print the top record of a data frame somewhere in my notebook so I can refer back to it if I forget what’s inside.

Count rows

This is not a pandas function, but len() counts rows and can be saved to a variable and used elsewhere.

Count unique rows

Count unique values in a column:

Get data frame info

Useful for getting some general information like header, number of values and datatype by column. A similar but less useful function is df.dtypes which just gives column data types.

Get statistics

Really useful if the data frame has a lot of numeric values. Knowing the mean, min and max of the rating column give us a sense of how the data frame looks overall.

Get counts of values

Get a list or series of values for a column

This works if you need to pull the values in columns into x and y variables so you can fit a machine learning model.

Get a list of index values

Create a list of values from index.

Get a list of column values

Append new column with a set value

I do this on occasion when I have test and train sets in 2 separate data frames and want to mark which rows are related to what set before combining them. Free download driver printer hp laserjet 1020 for mac.

Create new data frame from a subset of columns

Useful when you only want to keep a few columns from a giant data frame and don’t want to specify each that you want to drop.

Drop specified columns

Useful when you only need to drop a few columns. Otherwise, it can be tedious to write them all out and I prefer the previous option.

Add a row with sum of other rows

We’ll manually create a small data frame here because it’s easier to look at. The interesting part here is df.sum(axis=0) which adds the values across rows. Alternatively df.sum(axis=0) adds values across columns.

The same logic applies when calculating counts or means, ie: df.mean(axis=0).

Concatenate 2 dataframes

Use this if you have 2 data frames with the same columns and want to combine them.

Here we split a data frame in 2 them add them back together.

Merge dataframes

This functions like a SQL left join, when you have 2 data frames and want to join on a column.

Retrieve rows with matching index values

The index values in anime_modified are the names of the anime. Notice how we’ve used those names to grab specific columns.

Retrieve rows by numbered index values

This differs from the previous function. Using iloc, the 1st row has an index of 0, the 2nd row has an index of 1, and so on… even if you’ve modified the data frame and are now using string values in the index column.

Use this is you want the first 3 rows in a data frame.

Get rows

Retrieve rows where a column’s value is in a given list. anime[anime['type'] 'TV'] also works when matching on a single value.

Slice a dataframe

This is just like slicing a list. Slice a data frame to get all rows before/between/after specified indices.

Filter by value

Filter data frame for rows that meet a condition. Note this maintains existing index values.

sort_values

Sort data frame by values in a column.

Groupby and count

Count number of records for each distinct value in a column.

Pandas Visualization Cheat Sheet

Groupby and aggregate columns in different ways

Note I added reset_index() otherwise the type column becomes the index column — I recommend doing the same in most cases.

Create a pivot table

Nothing better than a pivot table for pulling a subset of data from a data frame.

Note I’ve heavily filtered the data frame so it’s quicker to build the pivot table.

Set NaN cells to some value

Set cells with NaN value to 0 . In the example we create the same pivot table as before but without fill_value=0 then use fillna(0) to fill them in afterwards.

Sample a data frame

I use this all the time taking a small sample from a larger data frame. It allows randomly rearranging rows while maintaining indices if frac=1

Iterate over row indices

Pandas Data Science Cheat Sheet

Iterate over index and rows in data frame.

Link to Content:
Pandas Cheat Sheet - Python for Data Science
Created/Published/Taught by:
Dataquest
Josh Devlin
Content Found Via:
Open Data Science
Free? Yes
Tags: data cleaning / functions / importing data / pandas / python / statistics

Content Type: Cheat Sheets / References, Learning Guides, Etc.
Difficulty Rating: