naniar 1.0.0
I’m very pleased to announce that naniar version 1.0.0 is now on CRAN!
Version 1.0.0 of naniar is to signify that this release is associated with the publication of the associated JSS paper doi:10.18637/jss.v105.i07 (!!!). This paper has been the labour of a lot of effort between myself and Di Cook, and I am very excited to be able to share it.
There is still a lot to do in naniar, and this release does not signify that there are no changes upcoming. It is a 1.0.0 release to establish that this is a stable release, and any changes upcoming will go through a more formal deprecation process.
Here’s a brief description of some of the changes in this release
New things
JSS publication
You can now retrieve a citation for naniar
with citation()
:
citation("naniar")
#>
#> To cite naniar in publications use:
#>
#> Tierney N, Cook D (2023). "Expanding Tidy Data Principles to
#> Facilitate Missing Data Exploration, Visualization and Assessment of
#> Imputations." _Journal of Statistical Software_, *105*(7), 1-31.
#> doi:10.18637/jss.v105.i07 <https://doi.org/10.18637/jss.v105.i07>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations},
#> author = {Nicholas Tierney and Dianne Cook},
#> journal = {Journal of Statistical Software},
#> year = {2023},
#> volume = {105},
#> number = {7},
#> pages = {1--31},
#> doi = {10.18637/jss.v105.i07},
#> }
Set missing values with set_n_miss()
and set_prop_miss()
These functions allow you to set a random amount of missingness either as a number of values, or as a proportion:
library(naniar)
vec <- 1:10
# different each time
set_n_miss(vec, n = 1)
#> [1] NA 2 3 4 5 6 7 8 9 10
set_n_miss(vec, n = 1)
#> [1] 1 2 3 4 5 6 7 8 9 NA
set_prop_miss(vec, prop = 0.2)
#> [1] NA 2 3 NA 5 6 7 8 9 10
set_prop_miss(vec, prop = 0.6)
#> [1] 1 NA NA 4 NA NA NA 8 9 NA
I would suggest that these functions are used inside a dataframe. I will provide a few examples below using dplyr
. For just one variable, you could set missingness like so:
library(tidyverse)
#> ── Attaching packages ───────────────────────────── tidyverse 1.3.2 ──
#> ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
#> ✔ tibble 3.1.8 ✔ dplyr 1.1.0
#> ✔ tidyr 1.3.0 ✔ stringr 1.5.0
#> ✔ readr 2.1.3 ✔ forcats 1.0.0
#> ── Conflicts ──────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
mtcars_df <- as_tibble(mtcars)
vis_miss(mtcars_df)
mtcars_miss_mpg <- mtcars_df %>%
mutate(mpg = set_prop_miss(mpg, 0.5))
vis_miss(mtcars_miss_mpg)
Or add missingness to a few variables:
mtcars_miss_some <- mtcars_df %>%
mutate(
across(
c(mpg, cyl, disp),
\(x) set_prop_miss(x, 0.5)
)
)
mtcars_miss_some
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 NA NA NA 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 NA NA 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 NA 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 NA NA NA 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 NA 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 NA 245 3.21 3.57 15.8 0 0 3 4
#> 8 NA NA 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 NA 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 NA 123 3.92 3.44 18.3 1 0 4 4
#> # … with 22 more rows
vis_miss(mtcars_miss_some)
Or you can add missingness to all variables like so:
mtcars_miss_all <- mtcars_df %>%
mutate(
across(
everything(),
\(x) set_prop_miss(x, 0.5)
)
)
mtcars_miss_all
#> # A tibble: 32 × 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 NA NA 160 110 3.9 2.62 16.5 NA NA 4 NA
#> 2 21 NA NA 110 3.9 2.88 17.0 0 1 NA NA
#> 3 22.8 4 NA NA NA NA 18.6 1 NA 4 NA
#> 4 NA NA NA 110 NA NA 19.4 NA NA NA 1
#> 5 NA 8 NA NA NA 3.44 NA NA NA 3 2
#> 6 18.1 6 225 NA NA NA 20.2 1 0 3 1
#> 7 NA NA NA NA 3.21 3.57 NA 0 NA NA 4
#> 8 24.4 NA 147. NA 3.69 3.19 20 NA NA 4 2
#> 9 NA 4 141. 95 3.92 3.15 22.9 NA 0 NA NA
#> 10 NA NA 168. 123 3.92 NA NA NA 0 4 4
#> # … with 22 more rows
vis_miss(mtcars_miss_all)
miss_var_summary(mtcars_miss_all)
#> # A tibble: 11 × 3
#> variable n_miss pct_miss
#> <chr> <int> <dbl>
#> 1 mpg 16 50
#> 2 cyl 16 50
#> 3 disp 16 50
#> 4 hp 16 50
#> 5 drat 16 50
#> 6 wt 16 50
#> 7 qsec 16 50
#> 8 vs 16 50
#> 9 am 16 50
#> 10 gear 16 50
#> 11 carb 16 50
This resolves #298.
Bug Fixes and other small changes
-
Replaced
tidyr::gather
withtidyr::pivot_longer
(#301) -
Fixed bug in
gg_miss_var()
where a warning appears to due change in how to remove legend (#288). -
Removed package
gdtools
as it is no longer needed (302). -
Imported the packages,
vctrs
andcli
to assist with internal checking and error messages. Both of these packages are “free” dependencies, as they imported by existing dependencies,dplyr
andggplot2
.
Some thank yous
Thank you to everyone who has contributed to this release! Especially the following people: @ddauber, @davidgohel.
I am also excited to announce that I have been supported by the R Consortium to improve how R handles missing values! Through this grant, I will be improving the R packages naniar
and visdat
. I will be posting more details about this soon, but what this means for you the user is that there will be more updates and improvements to both of these packages in the coming months. Stay tuned.