Read file by extension into R.

import(
  file,
  format = "auto",
  rownames = TRUE,
  colnames = TRUE,
  sheet = 1L,
  skip = 0L,
  makeNames = getOption("acid.import.make.names", default = syntactic::makeNames),
  metadata = getOption("acid.import.metadata", default = FALSE),
  quiet = getOption("acid.quiet", default = FALSE)
)

Arguments

file

character(1). File path.

format

character(1). An optional file format type, which can be used to override the file format inferred from file. Only recommended for file and URL paths that don't contain an extension.

rownames

logical(1). Automatically assign row names, if rowname column is defined. Applies to file types that return data.frame only.

colnames

logical(1) or character. Automatically assign column names, using the first header row. Applies to file types that return data.frame only. Pass in a character vector to define the column names manually.

sheet

character(1) or integer(1). Applies to Excel Workbook, Google Sheet, or GraphPad Prism file. Sheet to read. Either a string (the name of a sheet), or an integer (the position of the sheet). Defaults to the first sheet.

skip

integer(1). Applies to delimited file (CSV, TSV), Excel Workbook, or lines. Number of lines to skip.

makeNames

function. Apply syntactic naming function to (column) names.

metadata

list. Metadata.

quiet

logical(1). Perform command quietly, suppressing messages.

Value

Varies, depending on the file type (format):

  • Plain text delimited (CSV, TSV, TXT): data.frame.
    Data separated by commas, tabs, or visual spaces.
    Note that TXT structure is amgibuous and actively discouraged.
    Refer to Data frame return section for details on how to change the default return type to DataFrame, tbl_df or data.table.
    Imported by vroom::vroom().

  • Excel workbook (XLSB, XLSX): data.frame.
    Resave in plain text delimited format instead, if possible.
    Imported by readxl::read_excel().

  • Legacy Excel workbook (pre-2007) (XLS): data.frame.
    Resave in plain text delimited format instead, if possible.
    Note that import of files in this format is slow.
    Imported by readxl::read_excel().

  • GraphPad Prism project (PZFX): data.frame.
    Experimental. Consider resaving in CSV format instead.
    Imported by pzfx::read_pzfx().

  • General feature format (GFF, GFF1, GFF2, GFF3, GTF): GRanges.
    Imported by rtracklayer::import().

  • MatrixMarket exchange sparse matrix (MTX): sparseMatrix.
    Imported by Matrix::readMM().

  • Gene sets (for GSEA) (GMT, GMX): character.

  • Browser extensible data (BED, BED15, BEDGRAPH, BEDPE): GRanges.
    Imported by rtracklayer::import().

  • ChIP-seq peaks (BROADPEAK, NARROWPEAK): GRanges.
    Imported by rtracklayer::import().

  • Wiggle track format (BIGWIG, BW, WIG): GRanges.
    Imported by rtracklayer::import().

  • JSON serialization data (JSON): list.
    Imported by jsonlite::read_json().

  • YAML serialization data (YAML, YML): list.
    Imported by yaml::yaml.load_file().

  • Lines (LOG, MD, PY, R, RMD, SH): character. Source code or log files.
    Imported by read_lines().

  • R data serialized (RDS): variable.
    Currently recommend over RDA, if possible.
    Imported by readRDS().

  • R data (RDA, RDATA): variable.
    Must contain a single object. Doesn't require internal object name to match, unlike loadData().
    Imported by load().

  • Infrequently used rio-compatible formats (ARFF, DBF, DIF, DTA, MAT, MTP, ODS, POR, SAS7BDAT, SAV, SYD, REC, XPT): variable.
    Imported by rio::import().

Details

import() supports automatic loading of common file types, by wrapping popular importer functions. It intentionally designed to be simple, with few arguments. Remote URLs and compressed files are supported. If you need more complex import settings, just call the wrapped importer directly instead.

Note

Updated 2020-08-13.

Row and column names

Row names. Row name handling has become an inconsistent mess in R because of differential support in base R, tidyverse, data.table, and Bioconductor. To maintain sanity, import() attempts to handle row names automatically. The function checks for a rowname column in delimited data, and moves these values into the object's row names, if supported by the return type (e.g. data.frame, DataFrame). Note that tbl_df (tibble) and data.table intentionally do not support row names. When returning in this format, no attempt to assign the rowname column into the return object's row names is made. Note that import() is strict about this matching and only checks for a rowname column, similar to the default syntax recommended in tibble::rownames_to_column(). To disable this behavior, set rownames = FALSE, and no attempt will be made to set the row names.

Column names. import() assumes that delimited files always contain column names. If you are working with a file that doesn't contain column names, either set colnames = FALSE or pass the names in as a character vector. It's strongly recommended to always define column names in a supported file type.

Data frame return

By default, import() returns a standard data.frame for delimited/column formatted data. However, any of these desired output formats can be set globally using options(acid.data.frame = "data.frame").

Supported return types:

Note that stringsAsFactors is always disabled for import.

Matrix Market Exchange (MTX)

Reading a Matrix Market Exchange file requires ROWNAMES and COLNAMES sidecar files containing the corresponding row and column names of the sparse matrix.

General feature format (GFF, GTF)

The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. The GTF (General Transfer Format) is identical to GFF version 2.

basejump exports the specialized makeGRangesFromGFF() function that makes GFF loading simple.

See also:

Gene sets (GMT, GMX)

Refer to the Broad Institute GSEA wiki for details.

bcbio count matrix

bcbio count matrix and related sidecar files are natively supported.

  • COUNTS: Counts table (e.g. RNA-seq aligned counts).

  • COLNAMES: Sidecar file containing column names.

  • ROWNAMES: Sidecar file containing row names.

Blacklisted extensions

These file formats are blacklisted, and intentionally not supported: DOC, DOCX, PDF, PPT, PPTX.

See also

Examples

file <- system.file("extdata/example.csv", package = "pipette") ## Row and column names enabled. x <- import(file)
#> → Importing example.csv at /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/RtmpNgLlSO/temp_libpath28297c43414/pipette/extdata using vroom::`vroom()`.
#> Setting row names from `rowname` column.
#> sample1 sample2 sample3 sample4 #> gene1 16 20 13 16 #> gene2 29 22 43 50 #> gene3 243 245 186 184 #> gene4 7 14 25 16 #> gene5 1 1 2 2 #> gene6 10 17 18 11
## Row and column names disabled. x <- import(file, rownames = FALSE, colnames = FALSE)
#> → Importing example.csv at /private/var/folders/l1/8y8sjzmn15v49jgrqglghcfr0000gn/T/RtmpNgLlSO/temp_libpath28297c43414/pipette/extdata using vroom::`vroom()`.
#> X1 X2 X3 X4 X5 #> 1 rowname sample1 sample2 sample3 sample4 #> 2 gene1 16 20 13 16 #> 3 gene2 29 22 43 50 #> 4 gene3 243 245 186 184 #> 5 gene4 7 14 25 16 #> 6 gene5 1 1 2 2