Format data frames and simple features using common approaches
Source:R/format_data.R
format_data.Rd
This function can apply the following common data cleaning tasks:
Applies stringr::str_squish and stringr::str_trim to all character columns
Optionally replaces all character values of "" with
NA
valuesOptionally corrects UNIX formatted dates with 1970-01-01 origins
Optionally renames variables by passing a named list of variables
The address functions previously included with format_data()
are now
documented at format_address_data()
.
Usage
format_data(
x,
var_names = NULL,
xwalk = NULL,
clean_names = TRUE,
.name_repair = "check_unique",
replace_na_with = NULL,
replace_with_na = NULL,
replace_empty_char_with_na = FALSE,
fix_date = FALSE,
label = FALSE,
remove_empty = NULL,
remove_constant = FALSE,
format_sf = FALSE,
...,
call = caller_env()
)
rename_with_xwalk(
x,
xwalk = NULL,
cols = c("label", "name"),
label = FALSE,
.strict = TRUE,
keep_all = TRUE,
arg = caller_arg(x),
call = caller_env()
)
label_with_xwalk(x, xwalk = NULL, label = "var", ...)
make_variable_dictionary(
x,
.labels = NULL,
.definitions = NULL,
details = c("basic", "none", "full")
)
fix_epoch_date(x, .cols = dplyr::contains("date"), tz = "")
Arguments
- x
A tibble or data frame object
- var_names
A named list following the format,
list("New var name" = old_var_name)
, or a two column data frame with the first column being the new variable names and the second column being the old variable names; defaults toNULL
.- xwalk
A data frame with two columns using the first column as name and the second column as value; or a named list. The existing names of x must be the values and the new names must be the names.
- clean_names
If
TRUE
, set .name_repair tojanitor::make_clean_names()
; defaults toTRUE
.- .name_repair
Defaults to "check_unique"
- replace_na_with
A named list to pass to
tidyr::replace_na()
; defaults toNULL
.- replace_with_na
A named list to pass to
naniar::replace_with_na()
; defaults toNULL
.- replace_empty_char_with_na
If
TRUE
, replace "" withNA
usingnaniar::replace_with_na_if()
, Default:TRUE
- fix_date
If
FALSE
, fix UNIX epoch dates (common issue with dates from FeatureServer and MapServer sources) using thefix_epoch_date()
function, Default:TRUE
- label
For
label_with_xwalk()
uselabel = "val"
to uselabelled::set_value_labels()
or "var" (default) to uselabelled::set_variable_labels()
. Forrename_with_xwalk()
, if label isTRUE
, xwalk is passed tolabel_with_xwalk()
with label = "var" to label columns using the original names. Defaults toFALSE
.- remove_empty
If not
NULL
, pass values ("rows", "cols" or c("rows", "cols") (default)) to the which parameter ofjanitor::remove_empty()
- remove_constant
If
TRUE
, pass data to janitor::remove_constant() using default parameters.- format_sf
If
TRUE
, pass x and additional parameters toformat_sf_data()
.- ...
Additional parameters passed to
format_sf_data()
- cols
Column names to use for crosswalk.
- .strict
If
TRUE
(default), require that all values from the xwalk are found in the column names of the x data.frame. IfFALSE
, unmatched values from the xwalk are ignored.- keep_all
If
FALSE
, columns that are not named in the xwalk are dropped. IfTRUE
(default), all columns are retained. If x is ansf
object, the geometry column will not be dropped even it is not renamed.- arg, call
Additional parameters used internally with
cli::cli_abort()
to improve error messages.- .labels
Replaces labels column created by
labelled::generate_dictionary()
if column is allNA
(no existing labels assigned); defaults toNULL
.- .definitions
Character vector of definitions appended to dictionary data frame. Must be in the same order as the variables in the provided data frame x.
- details
add details about each variable (full details could be time consuming for big data frames,
FALSE
is equivalent to"none"
andTRUE
to"full"
)- .cols
tidyselect for columns to apply epoch date fixing function to. Defaults to
dplyr::contains("date")
.- tz
Time zone passed to
as.POSIXct()
.
Examples
nc <- get_location_data(data = system.file("shape/nc.shp", package = "sf"))
format_data(nc)
#> Simple feature collection with 100 features and 14 fields
#> Geometry type: MULTIPOLYGON
#> Dimension: XY
#> Bounding box: xmin: -9386880 ymin: 4012991 xmax: -8399788 ymax: 4382079
#> Projected CRS: WGS 84 / Pseudo-Mercator
#> # A tibble: 100 × 15
#> area perimeter cnty cnty_id name fips fipsno cress_id bir74 sid74 nwbir74
#> * <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <dbl> <dbl>
#> 1 0.114 1.44 1825 1825 Ashe 37009 37009 5 1091 1 10
#> 2 0.061 1.23 1827 1827 Alle… 37005 37005 3 487 0 10
#> 3 0.143 1.63 1828 1828 Surry 37171 37171 86 3188 5 208
#> 4 0.07 2.97 1831 1831 Curr… 37053 37053 27 508 1 123
#> 5 0.153 2.21 1832 1832 Nort… 37131 37131 66 1421 9 1066
#> 6 0.097 1.67 1833 1833 Hert… 37091 37091 46 1452 7 954
#> 7 0.062 1.55 1834 1834 Camd… 37029 37029 15 286 0 115
#> 8 0.091 1.28 1835 1835 Gates 37073 37073 37 420 0 254
#> 9 0.118 1.42 1836 1836 Warr… 37185 37185 93 968 4 748
#> 10 0.124 1.43 1837 1837 Stok… 37169 37169 85 1612 1 160
#> # ℹ 90 more rows
#> # ℹ 4 more variables: bir79 <dbl>, sid79 <dbl>, nwbir79 <dbl>,
#> # geometry <MULTIPOLYGON [m]>