Data dummification is also known as one hot encoding or feature binarization. It turns each category to a distinct column with binary (numeric) values.
Value
dummified dataset (discrete features only) preserving original features. However, column order might be different.
Details
Continuous features will be ignored if added in select
.
select
features will be ignored if categories exceed maxcat
.
Note
This is different from model.matrix, where the latter aims to create a full rank matrix for regression-like use cases. If your intention is to create a design matrix, use model.matrix instead.
Examples
## Dummify iris dataset
str(dummify(iris))
#> 'data.frame': 150 obs. of 7 variables:
#> $ Sepal.Length : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#> $ Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#> $ Species_setosa : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ Species_versicolor: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ Species_virginica : int 0 0 0 0 0 0 0 0 0 0 ...
#> - attr(*, ".internal.selfref")=<externalptr>
## Dummify diamonds dataset ignoring features with more than 5 categories
data("diamonds", package = "ggplot2")
str(dummify(diamonds, maxcat = 5))
#> 2 features with more than 5 categories ignored!
#> color: 7 categories
#> clarity: 8 categories
#> tibble [53,940 × 14] (S3: tbl_df/tbl/data.frame)
#> $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
#> $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
#> $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
#> $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
#> $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
#> $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
#> $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
#> $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
#> $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
#> $ cut_Fair : int [1:53940] 0 0 0 0 0 0 0 0 1 0 ...
#> $ cut_Good : int [1:53940] 0 0 1 0 1 0 0 0 0 0 ...
#> $ cut_Ideal : int [1:53940] 1 0 0 0 0 0 0 0 0 0 ...
#> $ cut_Premium : int [1:53940] 0 1 0 1 0 0 0 0 0 0 ...
#> $ cut_Very.Good: int [1:53940] 0 0 0 0 0 1 1 1 0 1 ...
#> - attr(*, ".internal.selfref")=<externalptr>
str(dummify(diamonds, select = c("cut", "color")))
#> tibble [53,940 × 20] (S3: tbl_df/tbl/data.frame)
#> $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
#> $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
#> $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
#> $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
#> $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
#> $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
#> $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
#> $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
#> $ cut_Fair : int [1:53940] 0 0 0 0 0 0 0 0 1 0 ...
#> $ cut_Good : int [1:53940] 0 0 1 0 1 0 0 0 0 0 ...
#> $ cut_Ideal : int [1:53940] 1 0 0 0 0 0 0 0 0 0 ...
#> $ cut_Premium : int [1:53940] 0 1 0 1 0 0 0 0 0 0 ...
#> $ cut_Very.Good: int [1:53940] 0 0 0 0 0 1 1 1 0 1 ...
#> $ color_D : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
#> $ color_E : int [1:53940] 1 1 1 0 0 0 0 0 1 0 ...
#> $ color_F : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
#> $ color_G : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ...
#> $ color_H : int [1:53940] 0 0 0 0 0 0 0 1 0 1 ...
#> $ color_I : int [1:53940] 0 0 0 1 0 0 1 0 0 0 ...
#> $ color_J : int [1:53940] 0 0 0 0 1 1 0 0 0 0 ...
#> - attr(*, ".internal.selfref")=<externalptr>