Most of the commands here are from Pandas. Some are fairly common data manipulation commands not from Pandas.
Package #
import pandas as pd
Data #
Read CSV (or other delimited format) #
data = pd.read_csv("<path to file>")
Create dataframe from dictionary #
data = {...} # some dictionary
df = pd.DataFrame.from_dict(data)
By default, keys become the columns. Option orient='index'
changes things so keys become rows.
Transformation #
Subset columns #
Specify what to keep:
data.loc[:, ["<column name", "<column name". ...]]
Specify what to drop:
data_sub = data.drop(labels = <'list of column names', axis = 1>)
Make categorical #
Sometimes, it’s necessary to treat a field full of numbers as a categorical variable.
data['<field>'] = pd.Categorical(data.<field>)
# or
data['field'] = data.<field>.astype('<category type>')
Conditionally replace cell value #
data.loc[data.<field> == '<some string to match>' ', '<field>'] = '<new value>'
Join dataframes #
Left join:
pd.merge(df1, df2, how = "left", left_on = "<df1 column>", right_on = "<df2 column>")
References:
Analysis #
Data types #
Check the data types of a data frame
data.dtypes
Preview #
Quick ways to get a sense of what the data looks like.
data.head()
data.tail()
Column names #
Get column names:
data.columns
Or restructure as a list:
list(data.columns)