funModeling quick-start

funModeling

This package contains a set of functions related to exploratory data analysis, data preparation, and model performance. It is used by people coming from business, research, and teaching (professors and students).

funModeling is intimately related to the Data Science Live Book -Open Source- (2017) in the sense that most of its functionality is used to explain different topics addressed by the book.

Data Science Live Book

Blog posts based on `funModeling`:

Opening the black-box

Some functions have in-line comments so the user can open the black-box and learn how it was developed, or to tune or improve any of them.

All the functions are well documented, explaining all the parameters with the help of many short examples. R documentation can be accessed by: help("name_of_the_function").

About this quick-start

This quick-start is focused only on the functions. All explanations around them, and the how and when to use them, can be accessed by following the “Read more here.” links below each section, which redirect you to the book.

Below there are most of the funModeling functions divided by category.

Exploratory data analysis

`status`: Dataset health status (2nd version)

Similar to df_status, but it returns all percentages in the 0 to 1 range (not 1 to 100).

library(funModeling)

status(heart_disease)

##                                      variable q_zeros   p_zeros q_na       p_na
## age                                       age       0 0.0000000    0 0.00000000
## gender                                 gender       0 0.0000000    0 0.00000000
## chest_pain                         chest_pain       0 0.0000000    0 0.00000000
## resting_blood_pressure resting_blood_pressure       0 0.0000000    0 0.00000000
## serum_cholestoral           serum_cholestoral       0 0.0000000    0 0.00000000
## fasting_blood_sugar       fasting_blood_sugar     258 0.8514851    0 0.00000000
## resting_electro               resting_electro     151 0.4983498    0 0.00000000
## max_heart_rate                 max_heart_rate       0 0.0000000    0 0.00000000
## exer_angina                       exer_angina     204 0.6732673    0 0.00000000
## oldpeak                               oldpeak      99 0.3267327    0 0.00000000
## slope                                   slope       0 0.0000000    0 0.00000000
## num_vessels_flour           num_vessels_flour     176 0.5808581    4 0.01320132
## thal                                     thal       0 0.0000000    2 0.00660066
## heart_disease_severity heart_disease_severity     164 0.5412541    0 0.00000000
## exter_angina                     exter_angina     204 0.6732673    0 0.00000000
## has_heart_disease           has_heart_disease       0 0.0000000    0 0.00000000
##                        q_inf p_inf    type unique
## age                        0     0 integer     41
## gender                     0     0  factor      2
## chest_pain                 0     0  factor      4
## resting_blood_pressure     0     0 integer     50
## serum_cholestoral          0     0 integer    152
## fasting_blood_sugar        0     0  factor      2
## resting_electro            0     0  factor      3
## max_heart_rate             0     0 integer     91
## exer_angina                0     0 integer      2
## oldpeak                    0     0 numeric     40
## slope                      0     0 integer      3
## num_vessels_flour          0     0 integer      4
## thal                       0     0  factor      3
## heart_disease_severity     0     0 integer      5
## exter_angina               0     0  factor      2
## has_heart_disease          0     0  factor      2

Note: df_status will be deprecated, please use status instead.

`data_integrity`: Dataset health status (2nd version)

A handy function to return different vectors of variable names aimed to quickly filter NA, categorical (factor / character), numerical and other types (boolean, date, posix).

It also returns a vector of variables which have high cardinality.

It returns an ‘integrity’ object, which has: ‘status_now’ (comes from status function), and ‘results’ list, following elements can be found: vars_cat, vars_num, vars_num_with_NA, etc. Explore the object for more.

library(funModeling)

di=data_integrity(heart_disease)

# returns a summary
summary(di)

## 
## ◌ {Numerical with NA} num_vessels_flour
## ◌ {Categorical with NA} thal

# print all the metadata information
print(di)

## $vars_num_with_NA
##                            variable q_na       p_na
## num_vessels_flour num_vessels_flour    4 0.01320132
## 
## $vars_cat_with_NA
##      variable q_na       p_na
## thal     thal    2 0.00660066
## 
## $vars_cat_high_card
## [1] variable unique  
## <0 rows> (or 0-length row.names)
## 
## $MAX_UNIQUE
## [1] 35
## 
## $vars_one_value
## character(0)
## 
## $vars_cat
## [1] "gender"              "chest_pain"          "fasting_blood_sugar"
## [4] "resting_electro"     "thal"                "exter_angina"       
## [7] "has_heart_disease"  
## 
## $vars_num
## [1] "age"                    "resting_blood_pressure" "serum_cholestoral"     
## [4] "max_heart_rate"         "exer_angina"            "oldpeak"               
## [7] "slope"                  "num_vessels_flour"      "heart_disease_severity"
## 
## $vars_char
## character(0)
## 
## $vars_factor
## [1] "gender"              "chest_pain"          "fasting_blood_sugar"
## [4] "resting_electro"     "thal"                "exter_angina"       
## [7] "has_heart_disease"  
## 
## $vars_other
## character(0)

`plot_num`: Plotting distributions for numerical variables

Plots only numeric variables.

plot_num(heart_disease)

Notes:

bins: Sets the number of bins (10 by default).
path_out indicates the path directory; if it has a value, then the plot is exported in jpeg. To save in current directory path must be dot: “.”

funModeling quick-start

funModeling quick-start

Blog posts based on funModeling:

Opening the black-box

About this quick-start

Exploratory data analysis

status: Dataset health status (2nd version)

data_integrity: Dataset health status (2nd version)

plot_num: Plotting distributions for numerical variables

profiling_num: Calculating several statistics for numerical variables

freq: Getting frequency distributions for categoric variables

Correlation

correlation_table: Calculates R statistic

var_rank_info: Correlation based on information theory

cross_plot: Distribution plot between input and target variable

plotar: Boxplot and density histogram between input and target variables

categ_analysis: Quantitative analysis for binary outcome

Data preparation

Data discretization

discretize_get_bins + discretize_df: Convert numeric variables to categoric

equal_freq: Convert numeric variable to categoric

discretize_rgr: Variable discretization based on gain ratio maximization

range01: Scales variable into the 0 to 1 range

Outliers data preparation

hampel_outlier and tukey_outlier: Gets outliers threshold

prep_outliers: Prepare outliers in a data frame

Predictive model performance

gain_lift: Gain and lift performance curve

coord_plot: Coordinate plot (clustering models)

Blog posts based on `funModeling`:

`status`: Dataset health status (2nd version)

`data_integrity`: Dataset health status (2nd version)

`plot_num`: Plotting distributions for numerical variables

`profiling_num`: Calculating several statistics for numerical variables

`freq`: Getting frequency distributions for categoric variables

`correlation_table`: Calculates R statistic

`var_rank_info`: Correlation based on information theory

`cross_plot`: Distribution plot between input and target variable

`plotar`: Boxplot and density histogram between input and target variables

`categ_analysis`: Quantitative analysis for binary outcome

`discretize_get_bins` + `discretize_df`: Convert numeric variables to categoric

`equal_freq`: Convert numeric variable to categoric

`discretize_rgr`: Variable discretization based on gain ratio maximization

`range01`: Scales variable into the 0 to 1 range

`hampel_outlier` and `tukey_outlier`: Gets outliers threshold

`prep_outliers`: Prepare outliers in a data frame

`gain_lift`: Gain and lift performance curve

`coord_plot`: Coordinate plot (clustering models)