Data dummification is also known as one hot encoding or feature binarization. It turns each category to a distinct column with binary (numeric) values.
dummify(data, maxcat = 50L, select = NULL)
data | input data |
---|---|
maxcat | maximum categories allowed for each discrete feature. Default is 50. |
select | names of selected features to be dummified. Default is |
dummified dataset (discrete features only) preserving original features. However, column order might be different.
Continuous features will be ignored if added in select
.
select
features will be ignored if categories exceed maxcat
.
This is different from model.matrix, where the latter aims to create a full rank matrix for regression-like use cases. If your intention is to create a design matrix, use model.matrix instead.
#> 'data.frame': 150 obs. of 7 variables: #> $ Sepal.Length : num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... #> $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... #> $ Petal.Length : num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... #> $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... #> $ Species_setosa : int 1 1 1 1 1 1 1 1 1 1 ... #> $ Species_versicolor: int 0 0 0 0 0 0 0 0 0 0 ... #> $ Species_virginica : int 0 0 0 0 0 0 0 0 0 0 ... #> - attr(*, ".internal.selfref")=<externalptr>## Dummify diamonds dataset ignoring features with more than 5 categories data("diamonds", package = "ggplot2") str(dummify(diamonds, maxcat = 5))#>#> #>#> tibble [53,940 × 14] (S3: tbl_df/tbl/data.frame) #> $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... #> $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... #> $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ... #> $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ... #> $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... #> $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... #> $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ... #> $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ... #> $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ... #> $ cut_Fair : int [1:53940] 0 0 0 0 0 0 0 0 1 0 ... #> $ cut_Good : int [1:53940] 0 0 1 0 1 0 0 0 0 0 ... #> $ cut_Ideal : int [1:53940] 1 0 0 0 0 0 0 0 0 0 ... #> $ cut_Premium : int [1:53940] 0 1 0 1 0 0 0 0 0 0 ... #> $ cut_Very.Good: int [1:53940] 0 0 0 0 0 1 1 1 0 1 ... #> - attr(*, ".internal.selfref")=<externalptr>#> tibble [53,940 × 20] (S3: tbl_df/tbl/data.frame) #> $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... #> $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... #> $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ... #> $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ... #> $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... #> $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... #> $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ... #> $ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ... #> $ cut_Fair : int [1:53940] 0 0 0 0 0 0 0 0 1 0 ... #> $ cut_Good : int [1:53940] 0 0 1 0 1 0 0 0 0 0 ... #> $ cut_Ideal : int [1:53940] 1 0 0 0 0 0 0 0 0 0 ... #> $ cut_Premium : int [1:53940] 0 1 0 1 0 0 0 0 0 0 ... #> $ cut_Very.Good: int [1:53940] 0 0 0 0 0 1 1 1 0 1 ... #> $ color_D : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ... #> $ color_E : int [1:53940] 1 1 1 0 0 0 0 0 1 0 ... #> $ color_F : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ... #> $ color_G : int [1:53940] 0 0 0 0 0 0 0 0 0 0 ... #> $ color_H : int [1:53940] 0 0 0 0 0 0 0 1 0 1 ... #> $ color_I : int [1:53940] 0 0 0 1 0 0 1 0 0 0 ... #> $ color_J : int [1:53940] 0 0 0 0 1 1 0 0 0 0 ... #> - attr(*, ".internal.selfref")=<externalptr>