Title: | Collection of Machine Learning Data Sets for 'mlr3' |
---|---|
Description: | A small collection of interesting and educational machine learning data sets which are used as examples in the 'mlr3' book (<https://mlr3book.mlr-org.com>), the use case gallery (<https://mlr3gallery.mlr-org.com>), or in other examples. All data sets are properly preprocessed and ready to be analyzed by most machine learning algorithms. Data sets are automatically added to the dictionary of tasks if 'mlr3' is loaded. |
Authors: | Michel Lang [ctb] , Marc Becker [cre, aut] |
Maintainer: | Marc Becker <[email protected]> |
License: | LGPL-3 |
Version: | 0.9.0 |
Built: | 2024-11-08 09:16:32 UTC |
Source: | https://github.com/mlr-org/mlr3data |
A small collection of interesting and educational machine learning data sets which are used as examples in the 'mlr3' book (https://mlr3book.mlr-org.com), the use case gallery (https://mlr3gallery.mlr-org.com), or in other examples. All data sets are properly preprocessed and ready to be analyzed by most machine learning algorithms. Data sets are automatically added to the dictionary of tasks if 'mlr3' is loaded.
Maintainer: Marc Becker [email protected] (ORCID)
Other contributors:
Michel Lang [email protected] (ORCID) [contributor]
Useful links:
Regression task to predict house sale prices for Ames, Iowa.
Contains 80 features and 2930 observations.
Target column is "Sale_Price"
.
data("ames_housing", package = "mlr3data") str(ames_housing)
data("ames_housing", package = "mlr3data") str(ames_housing)
Regression data to predict the total count of bikes rented. Contains
13 features and 17379 observations. Target column is "count"
.
All columns have been renamed.
instant
, "registered"
and "casual"
column have been removed.
"season"
and "weather"
have been converted to factor()
.
"holiday"
and "working_day"
have been converted to logical()
.
https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
data("bike_sharing", package = "mlr3data") str(bike_sharing)
data("bike_sharing", package = "mlr3data") str(bike_sharing)
Data for power consumption of kitchen appliances in Ames, Iowa.
Extends the ames_housing
data set.
Contains 720 features and 2930 observations.
data("energy_usage", package = "mlr3data") str(energy_usage)
data("energy_usage", package = "mlr3data") str(energy_usage)
Classification data to predict whether or not a person is a liver patient.
Obtained using the mlr3oml package. Contains 538 observations and 10
features. Target column is "diseased"
.
All variables have been renamed.
The target variable has been re-encoded to "yes"
and "no"
.
data("ilpd", package = "mlr3data") str(ilpd)
data("ilpd", package = "mlr3data") str(ilpd)
Regression task to predict house sale prices for King County, including Seattle, between May 2014 and May 2015.
Contains 19 features and 21613 observations.
Target column is "price"
.
Id column has been removed.
Dates in column "date"
have been converted from strings to POSIXct.
Values 0
in feature "yr_renovated"
have been replaced with NA
.
Values 0
in feature "sqft_basement"
have been replaced with NA
.
Feature "waterfront"
has been converted to logical.
https://www.kaggle.com/datasets/harlfoxem/housesalesprediction
data("kc_housing", package = "mlr3data") str(kc_housing)
data("kc_housing", package = "mlr3data") str(kc_housing)
Regression data to predict the number of runs scored. Obtained using the mlr3oml package.
Contains 14 features and 1232 observations.
Target column is "rs"
.
All variable names have been converted from upper case to lower case.
The variables "year"
, "rs",
"ra",
"w"' have been coerced to integers.
https://www.openml.org/d/41021
data("moneyball", package = "mlr3data") str(moneyball)
data("moneyball", package = "mlr3data") str(moneyball)
Classification data to predict handwritten digits. Obtained using the mlr3oml package.
Binarized version of the original data set. The multi-class target column has been converted to
a two-class nominal target column by re-labeling the majority class as positive ("P"
) and all
others as negative ("N"
). Originally converted by Quan Sun.
Contains 64 features and 5620 observations.
Target column is "binaryclass"
.
All feature variables "input1"
, ..., "input64"
(number of on pixels in each block) have
been coerced to integers.
The target variable has been renamed from "binaryClass"
to "binaryclass"
.
data("optdigits", package = "mlr3data") str(optdigits)
data("optdigits", package = "mlr3data") str(optdigits)
Classification data to predict the species of penguins from the palmerpenguins package. A better alternative to the iris data set.
The unit of measurement have been removed from the column names. Lengths are given in millimeters (mm), weight in gram (g).
Observations with missing values have been removed.
Factor variables are one-hot encoded.
Gorman KB, Williams TD, Fraser WR (2014). “Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis).” PLoS ONE, 9(3), e90081. doi:10.1371/journal.pone.0090081.
https://github.com/allisonhorst/palmerpenguins
data("penguins_simple", package = "mlr3data") str(penguins_simple)
data("penguins_simple", package = "mlr3data") str(penguins_simple)
Classification data to predict the fate of passengers on the ocean liner "Titanic".
Contains 10 features and 1309 observations. Target column is "Survived"
.
All column names have been changed to snake_case
.
training and test set have been joined.
Observations of the test set have a missing value in the target column "survived"
.
Column '"survived"' has been re-encoded to a factor with levels '"yes"' and '"no"'.
Id column has been removed.
Passenger class "pclass"
has been converted to an ordered factor.
Features "sex"
and "embarked"
have been converted to factors.
Empty strings in "cabin"
and "embarked"
have been encoded as missing values.
titanic and https://www.kaggle.com/c/titanic/data
data("titanic", package = "mlr3data") str(titanic)
data("titanic", package = "mlr3data") str(titanic)