Tutorial 5: Wrangle real data into a SummarizedExperiment

Until now, experiments were either simulated, imported from R or converted from a CommunityProfile. Producing instances of SummarizedExperiment from scratch might appear more challenging, as it requires some data analysis skills to wrangle the raw data into the necessary shape. Also, filtering missing values or setting some thresholds could also come into handy. Here, a few possible approaches are explained.

Assays

It is recommended to store experimental results and bioinformatics output into an Excel workbook or as a group of csv files, which can be easily imported.

using XLSX, CSV, DataFrames

# read assays stored in same excel and store them into separate variables
first_assay = XLSX.readdata(joinpath(@__DIR__, "assets/RE_assays.xlsx"), "Sheet1", "B2:E11") |> Matrix
second_assay = XLSX.readdata(joinpath(@__DIR__, "assets/RE_assays.xlsx"), "Sheet1", "F2:I11") |> Matrix

# replace NA values to Julia-native "missing" object class
replace!(first_assay, "NA" => missing)
replace!(second_assay, "NA" => missing)

# assemble assays into a single object
assays = OrderedDict{String, AbstractArray}("first_assay" => first_assay,
                                            "second_assay" => second_assay)

OrderedDict{String, AbstractArray} with 2 entries:
  "first_assay"  => Any[10482.2 1441.07 6448.97 7737.69; 9401.0 7211.86 4721.28…
  "second_assay" => Any[10787.3 12275.2 16580.6 2682.84; missing missing missin…

Rowdata and Coldata

After loading the csv file containing the rowdata, a few operations are needed to make it suitable for feeding into a SummarizedExperiment. For more analytical tips, the docs on DataFrames.jl and this toolkit are recommended.

# load rowdata from csv file
row_data = CSV.File(joinpath(@__DIR__, "assets/RE_rowdata.csv")) |> DataFrame

# rename "Column1" to "name"
rename!(row_data, "Column1" => :name)

# sort by feature number so rowdata order matches assay order
row_data[!, :order] = parse.(Int64, replace.(row_data[!, :name], "bin" => ""))
sort!(row_data, :order)

10×5 DataFrame

Row	name	genus	species	native	order
	String7	String31	String	String7	Int64
1	bin1	g__MGIIa-L2	s__MGIIa-L2 sp013911465	true	1
2	bin2	g__UBA3478	s__UBA3478 sp011525015	false	2
3	bin3	g__GCA-2684655	s__	true	3
4	bin4	g__	s__	nothing	4
5	bin5	g__Shewanella	s__	true	5
6	bin6	g__Sphingomonas	s__Sphingomonas ginsenosidimutans	true	6
7	bin7	g__Flavobacterium	s__Flavobacterium sp002862805	true	7
8	bin8	g__Vibrio	s__Vibrio azureus	false	8
9	bin9	g__Polaribacter	s__	nothing	9
10	bin10	g__Alteromonas	s__Alteromonas sp009811415	false	10

The same method can be applied to the coldata. If it is not available or is not relevant for the analysis, it can be substituted with an empty DataFrame containing only the sample names as done below. Then, we gather the components into a SummarizedExperiment.

# create empty coldata object
col_data = DataFrame(
    name = ["sample$i" for i in 1:4]
);

# assemble into SummarizedExperiment
se = SummarizedExperiment(assays, row_data, col_data)

10x4 SummarizedExperiment
  assays(2): first_assay second_assay
  rownames: bin1 bin2 ... bin9 bin10
  rowdata(5): name genus ... native order
  colnames: sample1 sample2 sample3 sample4
  coldata(1): name
  metadata(0):

Filtering

The most convenient technique to filter a SummarizedExperiment involves the map function combined with either the eachrow or eachcol iterators, depending on whether you want to filter row-wise or column-wise. First, a custom condition, such as x -> mean(x) == 10, should be passed to map together with an iterator across the dimensions of the assay of interest, such as eachcol(assay(se, "my_assay")). This returns a vector of indices that can be used to filter the rows or columns of the SummarizedExperiment, as shown here:

# drop rows with missing values in some assay
se2 = dropmissing(se, "second_assay")

# filter features that have a peak abundance of at least 10000 reads
keep_rows = map(x -> maximum(x) >= 10000, eachrow(assay(se2, "first_assay")))
se2 = se2[keep_rows, :]

# filter features that are native to the sampling site
keep_rows = map(x -> x == "true", rowdata(se2)[!, :native])
se2 = se2[keep_rows, :]

# filter samples that contain more than 15000 reads
keep_cols = map(x -> sum(x) > 15000, eachcol(assay(se2, "second_assay")))
se2 = se2[:, keep_cols]

1x1 SummarizedExperiment
  assays(2): first_assay second_assay
  rownames: bin1
  rowdata(5): name genus ... native order
  colnames: sample3
  coldata(1): name
  metadata(0):

Toolkit to manipulate DataFrames

The following commands come with DataFrames.jl (apart from the last one) and can help you prepare and shape the raw files into a SummarizedExperiment.

rename!
replace!
sort!
select!
leftjoin!
copy