This function is designed to read cell level information (and the finally analyze, compose and collate_columns) from many file types like xls, pdf, doc etc. This is a wrapper function to functions from multiple packages. The support for a specific file is dependent on the installed packages. To see the list of supported files and potentially required packages (if any) just run read_cells() in the console. This function supports the file format based on content and not based on just the file extension. That means if a file is saved as pdf and then the extension is removed (or extension modified to say .xlsx) then also the read_cells will detect it as pdf and read its content.

Note :

  • read_cells is supposed to work for any kind of data. However, if it fails in intermediate stage it will raise a warning and give results till successfully processed stage.

  • The heuristic-algorithm are not well-optimized (yet) so may be slow on large files.

  • If the target table has numerical values as data and text as their attribute (identifier of the data elements), straight forward method is sufficient in the majority of situations. Otherwise, you may need to utilize other functions.

A Word of Warning :

The functions used inside read_cells are heuristic-algorithm based. Thus, outcomes may be unexpected. It is recommend to try read_cells on the target file. If the outcome is expected., it is fine. If not try again with read_cells(file_name, at_level = "compose"). If after that also the output is not as expected then other functions are required to be used. At that time start again with read_cells(file_name, at_level = "make_cells") and proceed to further functions.

read_cells(
  x,
  at_level = c("collate", "detect_and_read", "make_cells", "va_classify", "analyze",
    "compose"),
  omit = NULL,
  simplify = TRUE,
  compose_main_cols_only = TRUE,
  from_level,
  silent = TRUE,
  ...
)

Arguments

x

either a valid file path or a read_cell_part

at_level

till which level to process. Should be one of detect_and_read, make_cells, va_classify, analyze, compose, collate. Or simply a number (like 1 means detect_and_read, 5 means compose).

omit

(Optional) the file-types to omit. A character vector.

simplify

whether to simplify the output. (Default TRUE). If set to FALSE a read_cell_part will be returned.

compose_main_cols_only

whether to compose main columns only. (Default TRUE).

from_level

(Optional) override start level. (read_cells will process after from_level)

silent

if TRUE no message will be displayed.(Default TRUE)

...

further arguments

Value

If simplify=TRUE then different kind of object is returned in different levels (depends on at_level). If at_level="compose" then only final tibble is returned otherwise if the output is not NULL an attribute will be present named "read_cells_stage".

If simplify=FALSE then it will return a read_cell_part which you can process manually and continue again with read_cells (perhaps then from_level may be useful).

Details

It performs following set of actions if called with default at_level.

Here is the flowchart of the same:

Examples

# see supported files read_cells()
#> Please provide a valid file path to process. #> Support present for following type of files: csv, xls, xlsx, doc, docx, pdf, html #> Note: #> = LibreOffice is present so doc files will be supported but it may take little longer time to read/detect. #> You may need to open LibreOffice outside this R-Session manually to speed it up. #> In case the doc is not working, try running docxtractr::read_docx('<target doc file>'). #> Check whether the file is being read correctly. #> = Support is enabled for content type (means it will work even if the extension is wrong) #> #> Details: #> +----------------------------------------------+ #> | | #> | type package present support | #> | 1 csv{utils} utils v v | #> | 2 csv readr v v | #> | 3 xls{readxl} readxl v v | #> | 4 xls xlsx v v | #> | 5 xlsx tidyxl v v | #> | 6 doc docxtractr v v | #> | 7 docx docxtractr v v | #> | 8 pdf tabulizer v v | #> | 9 html XML v v | #> | | #> +----------------------------------------------+
fold <- system.file("extdata", "messy", package = "tidycells", mustWork = TRUE) # File extension is intentionally given wrong # while filename is the actual identifier of the file type fcsv <- list.files(fold, pattern = "^csv.", full.names = TRUE)[1] # read the data read_cells(fcsv)
#> # A tibble: 4 x 5 #> collated_1 collated_2 collated_3 table_tag value #> <chr> <chr> <chr> <chr> <chr> #> 1 Weight Nakshatra Kid Name Table_1 12 #> 2 Weight Titas Kid Name Table_1 16 #> 3 Age Nakshatra Kid Name Table_1 1.5 #> 4 Age Titas Kid Name Table_1 6
read_cells(fcsv, simplify = FALSE)
#> A partial read_cell #> At stage collate