Let’s say you have a dataset stored as a Pandas dataframe, df
, with a numerical column, and another categorical column, and you want to compare the categories to determine if those different categories are statistically different from one another.
A Mann-Whitney U test might be appropriate, especially if the fundamental assumptions for the more convention t-test are not met (e.g., variance across the groups are similar, disributions are mostly normal)
A minimum viable code snippet to perform this test would look like:
import pandas as pd
import numpy as np
import scipy.stats as stats
group1_array = df.loc[df['group'] == "<group_1>"]["<numerical_values>"].to_numpy()
group2_array = df.loc[df['group'] == "<group_2>"]["<numerical_values>"].to_numpy()
mwtest_results = stats.mannwhitneyu(x=fraud_array, y=not_fraud_array, alternative = 'two-sided')
This returns a stastics value and a p-value.
if there are NA values, you may want to toss in a .dropna()
in the array preparation.