Measures the association between a categorical variable and some continuous and/or categorical variables

catdesc(y, x, weights = NULL, 
na.rm.cat = FALSE, na.value.cat = "NA", na.rm.cont = FALSE,
measure = "phi", limit = NULL, correlation = "kendall", robust = TRUE, 
nperm = NULL, distrib = "asympt", digits = 2)

Arguments

y

the categorical variable to describe (must be a factor)

x

a data frame with continuous and/or categorical variables

weights

numeric vector of weights. If NULL (default), uniform weights (i.e. all equal to 1) are used.

na.rm.cat

logical, indicating whether NA values in the categorical variables should be silently removed before the computation proceeds. If FALSE (default), an additional level is added to the categorical variables (see na.value.cat argument).

na.value.cat

character. Name of the level for NA category. Default is "NA". Only used if na.rm.cat = FALSE.

na.rm.cont

logical, indicating whether NA values in the continuous variables should be silently removed before the computation proceeds. Default is FALSE.

measure

character. The measure of local association between categories of categorical variables. Can be "phi" for phi coefficient (default), "or" for odds ratios, "std.residuals" for standardized (i.e. Pearson) residuals, "adj.residuals" for adjusted standardized residuals or "pem" for local percentages of maximum deviation from independence.

limit

for the relationship between y and a categorical variable, only associations higher or equal to limit will be displayed. If NULL (default), they are all displayed.

correlation

character. The type of measure of correlation measure to use between two continuous variables : "pearson", "spearman" or "kendall" (default).

robust

logical. If TRUE (default), median and mad are used instead of mean and standard deviation.

nperm

numeric. Number of permutations for the permutation test of independence. If NULL (default), no permutation test is performed.

distrib

the null distribution of permutation test of independence can be approximated by its asymptotic distribution ("asympt", default) or via Monte Carlo resampling ("approx").

digits

numeric. Number of digits for mean, median, standard deviation and mad. Default is 2.

Value

A list of the following items :

variables

associations between y and the variables in x

bylevel

a list with one element for each level of y

Each element in bylevel has the following items :

categories

a data frame with categorical variables from x and local associations

continuous.var

a data frame with continuous variables from x and associations measured by correlation coefficients

Note

If nperm is not NULL, permutation tests of independence are computed and the p-values from these tests are provided.

References

Rakotomalala R., 'Comprendre la taille d'effet (effect size)', [http://eric.univ-lyon2.fr/~ricco/cours/slides/effect_size.pdf]

Author

Nicolas Robette

Examples

data(Movies)
catdesc(Movies$ArtHouse, Movies[,c("Budget","Genre","Country")])
#> $variables
#>   variable  measure association
#> 1    Genre Cramer V       0.554
#> 2  Country Cramer V       0.469
#> 3   Budget     Eta2       0.181
#> 
#> $bylevel
#> $bylevel$No
#> $bylevel$No$categories
#>           categories freq pct.y.in.x pct.x.in.y overall.pct.x    phi
#> 1        Country.USA  257       86.5       50.0          29.7  0.457
#> 2       Genre.Comedy  161       72.5       31.3          22.2  0.226
#> 3       Genre.Action  123       74.5       23.9          16.5  0.206
#> 4        Genre.SciFi   44       89.8        8.6           4.9  0.174
#> 5       Genre.Horror   25      100.0        4.9           2.5  0.156
#> 6    Genre.Animation   38       82.6        7.4           4.6  0.137
#> 7        Genre.Other   15       57.7        2.9           2.6  0.021
#> 8     Country.Europe   39       54.2        7.6           7.2  0.015
#> 9      Country.Other    6       23.1        1.2           2.6 -0.093
#> 10     Genre.ComDram   50       33.6        9.7          14.9 -0.149
#> 11 Genre.Documentary    9       11.7        1.8           7.7 -0.229
#> 12       Genre.Drama   49       20.3        9.5          24.1 -0.350
#> 13    Country.France  212       35.0       41.2          60.5 -0.405
#> 
#> $bylevel$No$continuous.var
#>   variables median.in.category overall.median mad.in.category overall.mad
#> 1    Budget           17218500        6127500        12309532     5156921
#>   correlation
#> 1       0.426
#> 
#> 
#> $bylevel$Yes
#> $bylevel$Yes$categories
#>           categories freq pct.y.in.x pct.x.in.y overall.pct.x    phi
#> 1     Country.France  393       65.0       80.9          60.5  0.405
#> 2        Genre.Drama  192       79.7       39.5          24.1  0.350
#> 3  Genre.Documentary   68       88.3       14.0           7.7  0.229
#> 4      Genre.ComDram   99       66.4       20.4          14.9  0.149
#> 5      Country.Other   20       76.9        4.1           2.6  0.093
#> 6     Country.Europe   33       45.8        6.8           7.2 -0.015
#> 7        Genre.Other   11       42.3        2.3           2.6 -0.021
#> 8    Genre.Animation    8       17.4        1.6           4.6 -0.137
#> 9       Genre.Horror    0        0.0        0.0           2.5 -0.156
#> 10       Genre.SciFi    5       10.2        1.0           4.9 -0.174
#> 11      Genre.Action   42       25.5        8.6          16.5 -0.206
#> 12      Genre.Comedy   61       27.5       12.6          22.2 -0.226
#> 13       Country.USA   40       13.5        8.2          29.7 -0.457
#> 
#> $bylevel$Yes$continuous.var
#>   variables median.in.category overall.median mad.in.category overall.mad
#> 1    Budget            2281690        6127500         1629608     5156921
#>   correlation
#> 1      -0.426
#> 
#> 
#>