Cheatsheet: Pandas, Dataframe, (commonly used)

Written on November 13, 2021 (3 min read)

Input and Output

  • pd.read_pickle(path, compression=’infer’)
    • ‘infer’ means detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’
  • df.to_pickle(path, compression=’infer’)
    • Compression mode may be any of the following possible values: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}. 
  • pd.read_csv(path, sep=’,’, dtype=None)
    • dtype: could specify for each column with {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
  • df.to_csv(path, index=False)
  • pd.read_excel(path, sheet_name=0header=0dtype=Noneengine=None)
    • header: int or None(if there is no header.)
    • dtype: could specify for each column with {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}
    • engine:
      • “xlrd” supports old-style Excel files (.xls).
      • “openpyxl” supports newer Excel file formats.
      • “odf” supports OpenDocument file formats (.odf, .ods, .odt).
      • “pyxlsb” supports Binary Excel files.
  • dataframe.to_excel(pathsheet_name=’Sheet1’)
  • class pd.ExcelWriter(path, engine=None, date_format=None, datetime_format=None, mode=’w’)
    • mode{‘w’, ‘a’}, default ‘w’
    • The code below shows the example to use pd.ExcelWriter, you could go to xlsxWriter documentation to see more on how to use class workbook and class worksheet.

with pd.ExcelWriter(path, engine=’xlsxwriter’) as ew # Add a header format. workbook = ew.book header_format = workbook.add_format({ ‘bold’: True, ‘text_wrap’: True, ‘valign’: ‘top’, ‘fg_color’: ‘#D7E4BC’, ‘border’: 1}) # save dataframe to excel df.to_excel(ew, sheet_name=’Sheet1’, header=true, index=False) # apply fmt worksheet = ew.sheets[‘Sheet1’] worksheet.set_column(0, 0, width=15, cell_format=header_format)

General Function

  • Create dataframe

  • pd.melt(df, id_vars=None, value_vars=None, var_name=None, value_name=’value’)
    • Gather Column into rows. (Inverse Operation of pivot)
    • id_vars: tuple, list, or ndarray. Those columns that keep there position
    • value_vars: tuple, list, or ndarray Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars
    • var_name: name to use for the ‘variable’ column
    • value_name: Name to use for the ‘value’ column

  • **pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None)**
    • merge 2 dataframe

  • pd.concat(objsaxis=0join=’outer’ignore_index=False)
    • axis{0/’index’, 1/’columns’}, default 0
    • The axis to concatenate along.join{‘inner’, ‘outer’}, default ‘outer’

  • pd.get_dummies(dataprefix=Noneprefix_sep=’, _dummy_na=Falsecolumns=Nonesparse=Falsedrop_first=Falsedtype=None)
    • Convert categorical variable into dummy/indicator variables
    • data:dataarray-like, Series, or DataFrame
  • pd.to_datetime(arg, errors=’raise’, dayfirst=False, yearfirst=False, utc=None, format=None, exact=True)
    • arg: int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like
    • errors{‘ignore’, ‘raise’, ‘coerce’}, default ‘raise’
      • If ‘raise’, then invalid parsing will raise an exception.
      • If ‘coerce’, then invalid parsing will be set as NaT.
      • If ‘ignore’, then invalid parsing will return the input.
    • Format: The strftime to parse time, eg “%d/%m/%Y”, note that “%f” will parse all the way up to nanoseconds.
  • pd.eval(exprparser=’pandas’target=None)
    • Evaluate a Python expression as a string using various backends. (Not frequently used personally)
    • Exmaple: Add a new column using pd.eval:

pd.eval(“double_age = df.age * 2”, target=df)

Dataframe

Sorting, reindexing, renaming

  • df.sort_values(byaxis=0ascending=Trueinplace=Falsekind=’quicksort’)
    • kind{‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’ (For DataFrames, this option is only applied when sorting on a single column or label.)
  • df.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors=’ignore’)
  • df.reset_index(level=None, _drop=_True)
    • drop[bool]: Do not try to insert index into dataframe columns. This resets the index to the default integer index.

Subset Observation: df.length, df.drop_duplicates(subset=None), df_sample, df_nsmallest, df.head(n), df.tail(n), df.filter(), df.query()

  • df.filter(items=Nonelike=Noneregex=Noneaxis=None)

Select columns whose name matches regular expression regex.

Summarize Data

  • df.describe(percentiles=Noneinclude=Noneexclude=Nonedatetime_is_numeric=False)
    • Generate descriptive statistics.
  • Series.nunique(dropna=True)
    • Return number of unique elements in the object.
  • Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)
    • Return a Series containing counts of unique values.

Handling Missing Data & Sanity Check Empty Data

  • DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
    • Remove missing value rows(axis=0) or columns(axis=1)
  • DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
  • print(df.isnull().sum().sum())

Group Data

  • DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=NoDefault.no_default, observed=False, dropna=True)
    • by: mapping, function, label, or list of labels
    • dropna [bool], default True

Window Function

  • df.rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method=’single’)

    • window[int]: Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
    • min_periods[int]: Minimum number of observations in window required to have a value 

    • win_type: all possible win type can be found here
  • df.expanding(min_periods=1center=Noneaxis=0method=’single’)

    • Provide expanding transformations.

Explode

  • df.explode(columnignore_index=False)
    • Transform each element of a list-like to a row, replicating index values.

Reference

  1. Dataframe cheatsheet

Share this on Twitter
Leave a comment