Cheatsheet: Pandas, Dataframe, (commonly used)

Written on November 13, 2021 (3 min read)

dl-ml-python

Input and Output

pd.read_pickle(path, compression=’infer’)
- ‘infer’ means detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’
df.to_pickle(path, compression=’infer’)
- Compression mode may be any of the following possible values: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}.
pd.read_csv(path, sep=’,’, dtype=None)
- dtype: could specify for each column with {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
df.to_csv(path, index=False)
pd.read_excel(path, sheet_name=0, header=0, dtype=None, engine=None)
- header: int or None(if there is no header.)
- dtype: could specify for each column with {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
- engine:
  - “xlrd” supports old-style Excel files (.xls).
  - “openpyxl” supports newer Excel file formats.
  - “odf” supports OpenDocument file formats (.odf, .ods, .odt).
  - “pyxlsb” supports Binary Excel files.
dataframe.to_excel(path, sheet_name=’Sheet1’)
class pd.ExcelWriter(path, engine=None, date_format=None, datetime_format=None, mode=’w’)
- mode{‘w’, ‘a’}, default ‘w’
- The code below shows the example to use pd.ExcelWriter, you could go to xlsxWriter documentation to see more on how to use class workbook and class worksheet.

with pd.ExcelWriter(path, engine=’xlsxwriter’) as ew # Add a header format. workbook = ew.book header_format = workbook.add_format({ ‘bold’: True, ‘text_wrap’: True, ‘valign’: ‘top’, ‘fg_color’: ‘#D7E4BC’, ‘border’: 1}) # save dataframe to excel df.to_excel(ew, sheet_name=’Sheet1’, header=true, index=False) # apply fmt worksheet = ew.sheets[‘Sheet1’] worksheet.set_column(0, 0, width=15, cell_format=header_format)

General Function

Create dataframe

pd.melt(df, id_vars=None, value_vars=None, var_name=None, value_name=’value’)
- Gather Column into rows. (Inverse Operation of pivot)
- id_vars: tuple, list, or ndarray. Those columns that keep there position
- value_vars: tuple, list, or ndarray Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars
- var_name: name to use for the ‘variable’ column
- value_name: Name to use for the ‘value’ column

**pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None)**
- merge 2 dataframe

pd.concat(objs, axis=0, join=’outer’, ignore_index=False)
- axis{0/’index’, 1/’columns’}, default 0
- The axis to concatenate along.join{‘inner’, ‘outer’}, default ‘outer’

pd.get_dummies(data, prefix=None, prefix_sep=’‘, _dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)
- Convert categorical variable into dummy/indicator variables
- data:dataarray-like, Series, or DataFrame
pd.to_datetime(arg, errors=’raise’, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True)
- arg: int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
- errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
  - If ‘raise’, then invalid parsing will raise an exception.
  - If ‘coerce’, then invalid parsing will be set as NaT.
  - If ‘ignore’, then invalid parsing will return the input.
- Format: The strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds.
pd.eval(expr, parser=’pandas’, target=None)
- Evaluate a Python expression as a string using various backends. (Not frequently used personally)
- Exmaple: Add a new column using pd.eval:

pd.eval(“double_age = df.age * 2”, target=df)

Dataframe

Sorting, reindexing, renaming

df.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’)
- kind{‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’ (For DataFrames, this option is only applied when sorting on a single column or label.)
df.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors=’ignore’)
df.reset_index(level=None, _drop=_True)
- drop[bool]: Do not try to insert index into dataframe columns. This resets the index to the default integer index.

Subset Observation: df.length, df.drop_duplicates(subset=None), df_sample, df_nsmallest, df.head(n), df.tail(n), df.filter(), df.query()

df.filter(items=None, like=None, regex=None, axis=None)

Select columns whose name matches regular expression regex.

Summarize Data

df.describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
- Generate descriptive statistics.
Series.nunique(dropna=True)
- Return number of unique elements in the object.
Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)
- Return a Series containing counts of unique values.

Handling Missing Data & Sanity Check Empty Data

DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
- Remove missing value rows(axis=0) or columns(axis=1)
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
print(df.isnull().sum().sum())

Group Data

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
- by: mapping, function, label, or list of labels
- dropna [bool], default True

Window Function

df.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method=’single’)
- window[int]: Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
- min_periods[int]: Minimum number of observations in window required to have a value
- win_type: all possible win type can be found here
df.expanding(min_periods=1, center=None, axis=0, method=’single’)
- Provide expanding transformations.

Explode

df.explode(column, ignore_index=False)
- Transform each element of a list-like to a row, replicating index values.

Reference

Dataframe cheatsheet

Liangliang

Cheatsheet: Pandas, Dataframe, (commonly used)

Share this on Twitter

Leave a comment