============================================================================================================
START readme
============================================================================================================

The readme file is a document that accompanies the dataset. It provides an explanation of the dataset, 
making the dataset understandable and reusable. 

============================================================================================================
## Title
============================================================================================================

MicrobiomeMilkMap 2021-2022: Data underlying the publication "Seasonal and geographical impact on the Irish 
raw milk microbiota correlates with chemical composition and climatic variables"

============================================================================================================
## Methods and Materials
============================================================================================================

Season and location have previously been shown to be associated with differences in the microbiota of raw milk, 
especially in milk from pasture-based systems. Here we further advance research in this area by examining 
differences in the raw milk microbiota from several locations across Ireland over 12 months, and by investigating 
microbiota associations with climatic variables and chemical composition. Shotgun metagenomic sequencing was used 
to investigate the microbiota of raw milk collected from 9 locations (n=241). Concurrent chemical analysis of the 
protein, fat, lactose, total solids, nonprotein nitrogen (NPN) contents and titratable acidity (TA) of the same 
raw milk were performed. 

- Sample collection and preparation: Raw bovine milk samples (200 ml) were collected from silos from 9 locations 
across Ireland weekly from March 2021 to March 2022 (n=241). The samples were collected over 2 days, transported under 
refrigeration and stored at 4 degrees C, to mimic conditions of their storage in bulk tanks or silos, for a maximum of 
48h before sample processing of all samples together. Samples were prepared as follows: 30 ml of the bovine milk 
sample was centrifuged at 4,500 x g for 20 min at 4 degrees C. After centrifugation, the cream and supernatant were 
discarded, and the pellets were subjected to two washing steps, whereby the pellets were resuspended in sterile
PBS and centrifuged at 13,000 x g for 1 minute, after which the supernatant was discarded, and the pellet was 
stored at -20 degrees C before DNA extraction. 

- DNA extraction: Samples were subjected to DNA extraction using the MolYsis complete5 kit (Molzym GmBH & Co. KG, 
Bremen, Germany), with 50 microlitres of DNA eluted for downstream sequencing. The MolYsis kit was used to improve
microbiota characterization by significantly enhancing the microbial sequencing depth of milk samples. gDNA was 
quantified using the Qubit dsDNA HS assay kit (Invitrogen) and stored at -20 degrees C before library preparation.

- Shotgun metagenomic sequencing: 248 samples (241 samples and 7 controls) were prepared for shotgun metagenomic 
sequencing according to Illumina Nextera XT library preparation kit guidelines, using unique dual indexes for 
multiplexing with the Nextera XT index kit (Illumina). Following indexing and clean-up, samples were pooled to an 
equimolar concentration of 1 nM. Samples were sequenced in two pools, the first pool containing 98 samples on an 
Illumina NextSeq 550 sequencing platform with a V2 kit, and the second containing 150 samples on an Illumina NextSeq 
2000 sequencing platform with a P3 chip, at the Teagasc DNA Sequencing Facility, using standard Illumina sequencing 
protocols. 

- Bioinformatic processing: Default parameters were applied for all the bioinformatic tools unless otherwise specified. 
Quality checks and adapter trimming were performed with FastQC (0.11.8) and cutadapt (2.6) and host reads were aligned 
to the bovine genome (Bos taurus) and removed with Bowtie2 (2.4.4). Taxonomic classification was performed with Kraken2 
(2.0.7) (32) using the Genome Taxonomy Database (release 89) which contains Bacteria and Archaea. SUPER-FOCUS was used 
to predict the microbiological functional potential of shotgun reads, through the alignment of reads against a reduced 
SEED database using DIAMOND, with results classified into subsystems (sets of protein families with similar function). 
Resistome analysis was done using Resistance Gene Identifier (RGI 4.2.2), with the strict cut-off. Assembly of Metagenome 
Assembled Genomes (MAGs) was done using metaSPAdes (3.13), followed by binning with MetaBAT2 (2.12.1) and quality 
assessment with checkM (1.0.18). High-quality MAGs, of at least 90% completeness and less than 5% contamination were 
assigned taxonomy with GTDB-tk (2.1.1).

- Chemical analysis: The chemical composition of the 100 ml of milk samples was determined by DPTC analytical 
staff at the Technical Services lab at the Teagasc Food Research Centre. Kjeldahl analysis was used to determine 
protein and nonprotein nitrogen (NPN) contents. Rose Gottlieb method was used to determine fat content, and the 
CEM SMART Trac II (CEM, Matthews, NC, USA) was used to measure the total solids content. Polarimetry was used to 
determine the lactose content, and titration was used to determine titratable acidity (TA) in raw milk samples.

- Climactic data: Monthly climate data for the sampling locations relating to mean temperature (degrees C), total rainfall 
(mm), grass minimum temperature (degrees C), mean wind speed (knots) and sunshine duration (daily hours of sun) was retrieved 
from the Irish Meteorological Service website (www.met.ie). The months of March, April and May were classified as 
Spring, June, July and August as Summer, September, October and November as Autumn and December, January and 
February as Winter.

- Statistical analysis: Statistical analysis and data visualization was performed in R (4.1.2). All data was cleaned, 
analyzed and visualised in R with ggplot2, tidyverse and ggpubr packages (44, 45). Kruskal-Wallis and pairwise 
Wilcoxon rank sum tests with Benjamini-Hochberg P-value correction were used to compare sampling seasons and locations. 
Microbiota diversity analysis was performed with the vegan package (46), and beta diversity was calculated as Bray-Curtis 
metrics, visualised in a principal coordinate analysis plot. The adonis function from the vegan package was used to 
calculate the permutational analysis of variance (PERMANOVA) to determine differences in composition of the community 
between groups of samples (number of permutations=999). Redundancy analysis was also done with vegan and visualised 
using the ggord package. The multiplatt function from the indicspecies package was used to identify taxa that were 
significantly associated with particular seasons and sampling locations, by calculating Pearson's phi coefficient of 
association and correcting for unequal group sizes using the parameter r.g. Pearson's correlation was measured 
with the R base function, cor, and visualised using ggcorrplot.

============================================================================================================
## Software, and Code
============================================================================================================

Software Required: 

+ R 4.1.2
+ Rstudio 2022.07.1 Build 554
+ FastQX 0.11.8
+ cutadapt 2.6
+ Bowtie2 2.4.4
+ Kraken2 2.0.7 (Genome Taxonomy Database, release 89)
+ SUPER-FOCUS 0.34 (db90 - clustered SEED database (2019))
+ diamond 0.9.24.125
+ Resistance Gene Identifier (RGI) 4.2.2
+ metaSPAdes 3.13
+ MetaBAT2 2.12.1
+ checkM 1.0.18
+ GTDB-tk 2.1.1

Source code:

Available at: https://github.com/aforestsomewhere/MilkMicrobiomeMap

============================================================================================================
## FileFormats
============================================================================================================

.csv
.sh

============================================================================================================
## CodeBook
============================================================================================================

Please view codebook.csv (where this readme file is also found) for documentation of abbreviations, 
column names, datapoints, etc. The file codebook.csv uses the columns:

+ index = a number used to distinguish the different entries.
+ used = location where the code is used. (filename, foldername, columnname, datapoint, protocol, etc.).
+ code = the abbreviation, variable / data / column name used.
+ meaning = the literal meaning of the code. (e.g., fully written out abbreviation)
+ represents = what the code represents in terms of data or usage. (e.g., units of measurements, 
coding used,more in depth explanation)
+ missing values = value used to indicate missing values.

Note that this file is ';' delimited. To avoid possible confusion and inconsistencies, sentences within 
cells do not contain reading symbols as comma's or semicolons. When required, separation within sections of
a sentence is made possible using the hashtag symbol (#). Example: The sex of an animal is described as
"m = male pig (boar) # f = female pig (sow)" where the hashtag separates the element in a sentence.

============================================================================================================
## Other
============================================================================================================

Please view the associated metadata and codebook file within the parent folder of the data package (where 
you found this readme file).

Not all parts of this data was published as it contains sensitive data of farmers. Access to the full data can be 
requested by contacting the corresponding author of the linked publication.

Scripts used in the analysis of this dataset are archived in a private repository (https://github.com/aforestsomewhere/MilkMicrobiomeMapand) will be made available upon reasonable request. 

============================================================================================================
END readme
============================================================================================================
