Architecture

Cleaning tools

MyForestplot prepares cleaning tools for preparing dataframe used for forestplot. These tools are mainly designed to use results of statsmodels results.

[1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
import statsmodels.formula.api as smf

import myforestplot as mfp

%load_ext autoreload
%autoreload 2

%load_ext watermark
%watermark -n -u -v -iv -w -p graphviz
Last updated: Thu Sep 15 2022

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 8.0.1

graphviz: not installed

myforestplot: 0.2.1
numpy       : 1.21.5
pandas      : 1.4.1
statsmodels : 0.13.2
matplotlib  : 3.5.1

Watermark: 2.3.1

[2]:
data = pd.read_csv("titanic.csv")
data = data[["survived", "pclass", "sex", "age", "embark_town"]]
data = data.dropna()
res = smf.logit("survived ~ sex + age + embark_town", data=data).fit()
Optimization terminated successfully.
         Current function value: 0.509889
         Iterations 6

statsmodels_pretty_result_dataframe

This function converts statsmodelsresult into dataframe shape, adding number of observations for each category.

[3]:
order = ["age", "sex", "embark_town"]
cont_cols = ["age"]
item_order = {"embark_town": ['Southampton', 'Cherbourg', 'Queenstown']}
df = mfp.statsmodels_pretty_result_dataframe(data, res,
                                             order=order,
                                             cont_cols=cont_cols,
                                             item_order=item_order,
                                             fml=".3f",
                                             )
df
[3]:
category item 0 1 risk nobs risk_pretty
3 age age 0.979300 1.004771 0.991954 NaN 0.99 (0.98, 1.00)
0 sex male 0.057848 0.122213 0.084082 453.0 0.08 (0.06, 0.12)
4 sex female NaN NaN NaN 259.0 Ref.
2 embark_town Southampton 0.229654 0.581167 0.365332 554.0 0.37 (0.23, 0.58)
5 embark_town Cherbourg NaN NaN NaN 130.0 Ref.
1 embark_town Queenstown 0.057027 0.464428 0.162742 28.0 0.16 (0.06, 0.46)

“statsmodels_pretty_result_dataframe” is made up of 5 steps. 1. Convert statasmodels results into dataframe.

[4]:
df_res = mfp.statsmodels_fitting_result_dataframe(res, alpha=0.05, accessor=np.exp)
df_res
[4]:
category item 0 1 risk
1 sex male 0.057848 0.122213 0.084082
2 embark_town Queenstown 0.057027 0.464428 0.162742
3 embark_town Southampton 0.229654 0.581167 0.365332
4 age age 0.979300 1.004771 0.991954

If you want to obtain raw results, set accessor as lambda x: x

[5]:
df_res2 = mfp.statsmodels_fitting_result_dataframe(res, alpha=0.05, accessor=lambda x: x)
df_res2
[5]:
category item 0 1 risk
1 sex male -2.849937 -2.101986 -2.475962
2 embark_town Queenstown -2.864233 -0.766950 -1.815592
3 embark_town Southampton -1.471180 -0.542717 -1.006949
4 age age -0.020917 0.004760 -0.008079
  1. Obtain number of observations for each category variable.

[6]:
cate_cols = [c for c in order if not c in cont_cols]
df_nobs = mfp.count_category_frequency(data, cate_cols)
df_nobs
[6]:
category item nobs
0 sex male 453
1 sex female 259
2 embark_town Southampton 554
3 embark_town Cherbourg 130
4 embark_town Queenstown 28
  1. Merge statsmodels resutl and dataframe of number of observations.

[7]:
df_sum = pd.merge(df_res, df_nobs, on=["category", "item"], validate="1:1", how="outer")
df_sum
[7]:
category item 0 1 risk nobs
0 sex male 0.057848 0.122213 0.084082 453.0
1 embark_town Queenstown 0.057027 0.464428 0.162742 28.0
2 embark_town Southampton 0.229654 0.581167 0.365332 554.0
3 age age 0.979300 1.004771 0.991954 NaN
4 sex female NaN NaN NaN 259.0
5 embark_town Cherbourg NaN NaN NaN 130.0
  1. Sort items.

[8]:
df_sum = mfp.sort_category_item(df_sum, order=order, item_order=item_order)
df_sum
[8]:
category item 0 1 risk nobs
3 age age 0.979300 1.004771 0.991954 NaN
0 sex male 0.057848 0.122213 0.084082 453.0
4 sex female NaN NaN NaN 259.0
2 embark_town Southampton 0.229654 0.581167 0.365332 554.0
5 embark_town Cherbourg NaN NaN NaN 130.0
1 embark_town Queenstown 0.057027 0.464428 0.162742 28.0
  1. Add pretty styles of risk results.

[9]:
df_sum["risk_pretty"] = mfp.add_pretty_risk_column(df_sum,
                                                   risk="risk",
                                                   lower=0,
                                                   upper=1,
                                                   fml=".2f"
                                                   )
df_sum
[9]:
category item 0 1 risk nobs risk_pretty
3 age age 0.979300 1.004771 0.991954 NaN 0.99 (0.98, 1.00)
0 sex male 0.057848 0.122213 0.084082 453.0 0.08 (0.06, 0.12)
4 sex female NaN NaN NaN 259.0 Ref.
2 embark_town Southampton 0.229654 0.581167 0.365332 554.0 0.37 (0.23, 0.58)
5 embark_town Cherbourg NaN NaN NaN 130.0 Ref.
1 embark_town Queenstown 0.057027 0.464428 0.162742 28.0 0.16 (0.06, 0.46)

SimpleForestPlot

The following is an illustration of how myforestplot works.

[10]:
df = df_sum.copy()
[11]:
plt.rcParams["font.size"] = 8
fp = mfp.SimpleForestPlot(ratio=(8,3), dpi=150, figsize=(7,3), df=df)
fp.errorbar(errorbar_kwds=None)
fp.ax2.set_xlabel("OR")
fp.ax2.axvline(x=1, ymin=0, ymax=1.0, color="black", alpha=0.5)
fp.embed_strings("risk_pretty", 0.5, header="OR (95% CI)")

plt.show()
../_images/notebooks_3_architecture_20_0.png

BaseForestplot uses Gridspect to create 2 axes. One is for text, and one is for errorbar plot. To draw original axis ticks and labels for two axes.

[12]:
plt.rcParams["font.size"] = 8
fp = mfp.SimpleForestPlot(ratio=(8,3), dpi=150, figsize=(7,3), df=df,
                        yticks_show=True,
                        yticklabels_show=True,
                        xticks_show=True,
                        text_axis_off=False)
fp.errorbar(errorbar_kwds=None)
fp.ax2.set_xlabel("OR")
fp.ax2.axvline(x=1, ymin=0, ymax=1.0, color="black", alpha=0.5)
fp.embed_strings("risk_pretty", 0.5, header="OR (95% CI)")

plt.show()
../_images/notebooks_3_architecture_22_0.png
Originally, two axes shares yaxis, ranging from minus (number of plots - 1) to 0. For the text part axis, x axis ranges from 0 to one, and embed_strings just places text in this field.
So we have to arrange x position of texts manually to draw beautiful figures. However, this packages just provide basic functionaly, meaning much customizability to draw forestplot based on your preferences.

Also see Gallery section to know what kind of designs are available.

[ ]: