Tutorial 5: Wrangle real data into a SummarizedExperiment

Until now, experiments were either simulated, imported from R or converted from a CommunityProfile. Producing instances of SummarizedExperiment from scratch might appear more challenging, as it requires some data analysis skills to wrangle the raw data into the necessary shape. Also, filtering missing values or setting some thresholds could also come into handy. Here, a few possible approaches are explained.

Assays

It is recommended to store experimental results and bioinformatics output into an Excel workbook or as a group of csv files, which can be easily imported.

using XLSX, CSV, DataFrames

# read assays stored in same excel and store them into separate variables
first_assay = XLSX.readdata(joinpath(@__DIR__, "assets/RE_assays.xlsx"), "Sheet1", "B2:E11") |> Matrix
second_assay = XLSX.readdata(joinpath(@__DIR__, "assets/RE_assays.xlsx"), "Sheet1", "F2:I11") |> Matrix

# replace NA values to Julia-native "missing" object class
replace!(first_assay, "NA" => missing)
replace!(second_assay, "NA" => missing)

# assemble assays into a single object
assays = OrderedDict{String, AbstractArray}("first_assay" => first_assay,
                                            "second_assay" => second_assay)
OrderedDict{String, AbstractArray} with 2 entries:
  "first_assay"  => Any[10482.2 1441.07 6448.97 7737.69; 9401.0 7211.86 4721.28…
  "second_assay" => Any[10787.3 12275.2 16580.6 2682.84; missing missing missin…

Rowdata and Coldata

After loading the csv file containing the rowdata, a few operations are needed to make it suitable for feeding into a SummarizedExperiment. For more analytical tips, the docs on DataFrames.jl and this toolkit are recommended.

# load rowdata from csv file
row_data = CSV.File(joinpath(@__DIR__, "assets/RE_rowdata.csv")) |> DataFrame

# rename "Column1" to "name"
rename!(row_data, "Column1" => :name)

# sort by feature number so rowdata order matches assay order
row_data[!, :order] = parse.(Int64, replace.(row_data[!, :name], "bin" => ""))
sort!(row_data, :order)
10×5 DataFrame
Rownamegenusspeciesnativeorder
String7String31StringString7Int64
1bin1g__MGIIa-L2s__MGIIa-L2 sp013911465true1
2bin2g__UBA3478s__UBA3478 sp011525015false2
3bin3g__GCA-2684655s__true3
4bin4g__s__nothing4
5bin5g__Shewanellas__true5
6bin6g__Sphingomonass__Sphingomonas ginsenosidimutanstrue6
7bin7g__Flavobacteriums__Flavobacterium sp002862805true7
8bin8g__Vibrios__Vibrio azureusfalse8
9bin9g__Polaribacters__nothing9
10bin10g__Alteromonass__Alteromonas sp009811415false10

The same method can be applied to the coldata. If it is not available or is not relevant for the analysis, it can be substituted with an empty DataFrame containing only the sample names as done below. Then, we gather the components into a SummarizedExperiment.

# create empty coldata object
col_data = DataFrame(
    name = ["sample$i" for i in 1:4]
);

# assemble into SummarizedExperiment
se = SummarizedExperiment(assays, row_data, col_data)
10x4 SummarizedExperiment
  assays(2): first_assay second_assay
  rownames: bin1 bin2 ... bin9 bin10
  rowdata(5): name genus ... native order
  colnames: sample1 sample2 sample3 sample4
  coldata(1): name
  metadata(0):

Filtering

The most convenient technique to filter a SummarizedExperiment involves the map function combined with either the eachrow or eachcol iterators, depending on whether you want to filter row-wise or column-wise. First, a custom condition, such as x -> mean(x) == 10, should be passed to map together with an iterator across the dimensions of the assay of interest, such as eachcol(assay(se, "my_assay")). This returns a vector of indices that can be used to filter the rows or columns of the SummarizedExperiment, as shown here:

# drop rows with missing values in some assay
se2 = dropmissing(se, "second_assay")

# filter features that have a peak abundance of at least 10000 reads
keep_rows = map(x -> maximum(x) >= 10000, eachrow(assay(se2, "first_assay")))
se2 = se2[keep_rows, :]

# filter features that are native to the sampling site
keep_rows = map(x -> x == "true", rowdata(se2)[!, :native])
se2 = se2[keep_rows, :]

# filter samples that contain more than 15000 reads
keep_cols = map(x -> sum(x) > 15000, eachcol(assay(se2, "second_assay")))
se2 = se2[:, keep_cols]
1x1 SummarizedExperiment
  assays(2): first_assay second_assay
  rownames: bin1
  rowdata(5): name genus ... native order
  colnames: sample3
  coldata(1): name
  metadata(0):

Toolkit to manipulate DataFrames

The following commands come with DataFrames.jl (apart from the last one) and can help you prepare and shape the raw files into a SummarizedExperiment.

  • rename!
  • replace!
  • sort!
  • select!
  • leftjoin!
  • copy