After compose_cells, this function rearranges and rename attribute-columns in order to make columns properly aligned, based on the content of the columns.

collate_columns(
  composed_data,
  combine_threshold = 1,
  rest_cols = Inf,
  retain_other_cols = FALSE,
  retain_cell_address = FALSE
)

Arguments

composed_data

output of compose_cells (preferably not processed)

combine_threshold

a numerical threshold (between 0-1) for content-based collation of columns. (Default 1)

rest_cols

number of rest columns (beyond combine_threshold joins these many numbers of columns to keep)

retain_other_cols

whether to keep other intermediate (and possibly not so important) columns. (Default FALSE)

retain_cell_address

whether to keep columns like (row, col, data_block). This may be required for traceback (Default FALSE)

Value

A column collated data.frame

Details

  • Dependency on stringdist: If you have stringdist installed, the approximate string matching will be enhanced. There may be variations in outcome if you have stringdist vs if you don't have it.

  • Possibility of randomness: If the attribute column is containing many distinct values, then a column representative sample will be drawn. Hence it is always recommended to set.seed if reproducibility is a matter of concern.

Examples

d <- system.file("extdata", "marks_cells.rds", package = "tidycells", mustWork = TRUE) %>% readRDS() d <- numeric_values_classifier(d) da <- analyze_cells(d) dc <- compose_cells(da, print_attribute_overview = TRUE)
#> data_block = 1 #> minor_col_top_1_1 #> School A #> major_col_top_1_1 #> Score #> minor_corner_topLeft_1_1 #> Student Name #> major_row_left_2_1 #> Nakshatra Gayen, Titas Gupta, Ujjaini Gayen, Utsyo Roy #> major_row_left_1_1 #> Female, Male #> data_block = 2 #> minor_corner_topLeft_1_1 #> School B #> major_col_bottom_1_1 #> Indranil Gayen, S Gayen, Sarmistha Senapati, Shtuti Roy #> major_col_bottom_2_1 #> Female, Male #> minor_corner_bottomLeft_1_1 #> Student #> major_row_left_1_1 #> Score #> data_block = 3 #> major_col_top_1_1 #> Score #> minor_corner_topLeft_1_1 #> Name #> major_row_left_2_1 #> I Roy, S Ghosh, S Senapati, U Gupta #> major_row_left_1_1 #> School C #> minor_row_right_1_1 #> Female, Male
collate_columns(dc)
#> # A tibble: 12 x 6 #> collated_1 collated_2 collated_3 collated_4 collated_5 value #> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 Score Male School A Student Name Utsyo Roy 95 #> 2 Score Male School A Student Name Nakshatra Gayen 99 #> 3 Score Female School A Student Name Titas Gupta 89 #> 4 Score Female School A Student Name Ujjaini Gayen 100 #> 5 Score Male School B Student Indranil Gayen 70 #> 6 Score Male School B Student S Gayen 75 #> 7 Score Female School B Student Sarmistha Senapati 81 #> 8 Score Female School B Student Shtuti Roy 90 #> 9 Score Male School C Name I Roy 50 #> 10 Score Male School C Name S Ghosh 59 #> 11 Score Female School C Name S Senapati 61 #> 12 Score Female School C Name U Gupta 38