============================================================================================================
START readme
============================================================================================================

The readme file is a document that accompanies the dataset. It provides an explanation of the dataset, 
making the dataset understandable and reusable. 

This document explains the different sections of the readme template. 
Some sections may not be applicable to your dataset; you can delete  these sections from the file. 
Save the template as README.txt. 


============================================================================================================
## Title
============================================================================================================

Strain specific adaptations and community changes of milk and water kefir in response to environmental parameters (Kefi4All-Citizen Science Project)

============================================================================================================
## Methods and Materials
============================================================================================================


Description of the dataset - 
448 short read sequencing based fastq files 
48 long read sequencing fastq files
80 metabolic profiles

Household survey consisting of 100 responses 
Fermentation Survey consisting of 1823 responses 
Workout survey consisting of 87 workshop
Evaluation survey 18 responses 
Evaluation survey teachers 6 responses 

Methods - 


Prior to DNA extraction, 15 ml of kefir liquid was centrifuged at 5,444 × g for 30 minutes (min) at 4°C. Milk kefir liquid samples were washed with 10 ml of phosphate buffered saline (PBS) and centrifuged again at 5,444 × g for 15 min to remove protein and fat residues. Two additional rounds of washing and centrifugation were performed before pelleting the microbial cells. Water kefir liquid samples were washed with 1 ml of phosphate buffered saline (PBS) and centrifuged again at 5,444 × g for 1 min. The cell pellet was resuspended in 800 µl of Solution CD1 from the DNeasy PowerSoil Pro Kits. Total DNA from the resuspended pellets was extracted and purified according to the standard DNeasy PowerSoil Pro Kits protocol.
Total DNA was also extracted from each kefir grain. Fragments of 100 mg were removed from each of the grains and added to separate PowerBead tubes (Cambio, Cambridge, United Kingdom). The grain fragments were homogenized by shaking the PowerBead tube on the TissueLyser II (Qiagen, West Sussex, United Kingdom) at 20 Hz for 10 min. Following homogenization, DNA was purified from the sample by the method outlined above. Total DNA was initially quantified the Qubit High Sensitivity DNA assay (BioSciences, Dublin, Ireland). DNA was then stored at -20°C. Sequencing libraries were prepared using the Illumina DNA Prep (M) Tagmentation kit (Illumina) following the manufacturer’s guidelines. Libraries were sequenced for 300 bp paired-end reads using the Illumina NovaSeq6000 platform.	


Our study included five surveys: “Workshop Survey”, “Getting started workshop”,“Fermentation survey”, “Project completion survey – Citizen Scientists” and “Project completion survey – Teachers” all constructed using SurveyMonkey [24].  Students/members of the public were asked to complete the “Workshop Survey”, at the end of our recruitment workshop; which includes opinions provided by general members of the public who did not subsequently take part in the study. The “Workshop Survey” examined the educational effect on students as a result of attending a 1 hour workshop. Furthermore the “Workshop Survey” assessed the pre-existing microbiology- and fermentation-related knowledge of the participating citizen scientists. The “Fermentation survey” was used to collect metadata about each fermentation process completed by the citizen scientists. Both the “Project completion survey – Citizen scientists” and “Project completion survey – Teachers” primarily examined the educational effect on citizen scientists as a result of taking part in the project and the personal accounts of the Citizen scientists and teachers respectively, relating to improvements and or strengths of the project. Citizen scientists were asked to complete the “Fermentation survey” throughout the project and the “Project completion survey – Citizen scientists” at the end of the project. Coordinators of the Kefir4All project were asked to complete the “Project completion survey – Teachers” at the end of the project. All survey responses were analysed qualitatively. 

Citizen scientists were asked to complete a survey at the start of the project (https://www.surveymonkey.com/r/KefirLocation) to record information about their household environment, and then a separate survey relating to each individual fermentation (https://www.surveymonkey.com/r/KefirFermentation) to provide metadata  concerning the fermentation process. The citizen scientists were asked to provide both kefir grain and liquid samples after the first week of fermentation and every four weeks thereafter, for up to 21 weeks.




============================================================================================================
## Software, and Code
============================================================================================================

Software Required:

SurveyMonkey and Illumina NovaSeq sequencing platform. 

Source code:

Initially raw paired-end FASTQ files containing the metagenomic shotgun sequences were trimmed using Trimgalore (v.0.6.1) to remove adapter content and low quality reads (average quality score <Q20), fragmented (<75 bp) and with more than two ambiguous nucleotides. To identify contaminating DNA (i.e., from the substrate, host or sequencing run controls), milk kefir metagenomes were queried against a reference database containing a bovine, phiX 174 and human genome using Bowtie2 (v.2.3.4) with the parameter ‘-sensitive-local. Water kefir metagenomes were queried against a reference database containing a phiX 174 and human genome (GRCh38) using Bowtie2 (v2.3.4) with the parameter ‘-sensitive-local A. Reference genomes were downloaded from iGenomes (https://support.illumina.com/sequencing/sequencing_software/igenome.html) and represented the most up to date versions available in NCBI (November 2022). Remaining high-quality reads were sorted and split to create forward, reverse and unpaired reads output files for each metagenome. 
Compositional analysis was carried out with MetaCache (v 2.3) and functional profiling was perfomed with SUPER-FOCUS (v 0.34) using the DB_90 database and diamond for alignment, respectively. Strain-level profiling by  StrainPhlAn4 was using the parameters parameters --mutation_rates --marker_in_n_samples: 1 --sample_with_n_markers: 10 --phylophlan_mode: accurate. Normalised phylogenetic distance between strains of the same species were calculated using the tree_pairwisedists.py script in PyPhlAn (https://github.com/SegataLab/pyphlan). Metagenomic assembly was performed using SPAdes (v3.15.3) on individual metagenomes and merged metagenomes to recover individual sample assemblies and co-assemblies respectively. Several SPAdes (v3.15.3) commands were performed, specifically using the parameters  –meta parameter, for general assembly the –metaviral  parameter for viral assembly and the --metaplasmid [23] parameter for plasmid assembly. MetaWRAP (v1.3.2) was used for genome binning, with default settings. CheckM 2 (v0.1.3)   and busco (v5.1.3) were used to check the quality of the bacterial and eukaryotic metagenome-assembled genomes (MAGs) respectively. Low-quality MAGs, i.e., <50% completeness and/or >5% contamination, were removed from downstream analysis. Taxator-tk (v1.5.0) implemented in the metaWRAP's classify_bins module and GTDB-Tk [28] were used to assign taxonomy to the MAGs. GTDB-Tk was used to further identify putative new species based on ANI values <95%. dRep (v3.2.0) was used to cluster MAGs representing putative new species into primary and secondary clusters on the basis of their relative similarities. Average nucleotide identity between all strains was calculated using dRep (v3.2.0). Same strains were determined based on a protocol outlined in Feehily et al.. Briefly, MAGs with nucleotide identity (ANI) > 99.9% and genome coverage > 90 % were deemed to be the same strain. MAGs considered in this study were annotated with DRAM (Distilled and Refined Annotation of Metabolism) [30]. Guppy (v6.3.8) was used to extract the bases from the downloaded fast5 data and turn them into standard fastq files. The long read fastq file was processed with porechop v0.2.4 (https://github.com/rrwick/Porechop) to trim off sequencing adapters. Trimmed reads were aligned against the same reference database as the short reads using minimap2 (v0.2.4) with the parameters. As long reads that mapped to the reference database, were considered to be contaminating and set aside. The remaining unmapped reads were considered microbial reads. They were extracted from the sam format file using samtools with the command “samtools view -f 4”. 
Metagenomic hybrid assembly was carried out with OPERA-MS (v0.8.2). For all outlined tools applied (see above), default parameters were used unless specified otherwise. Analysis of resulting tables from each tool described was completed in R (v4.1.2).  


============================================================================================================
## FileFormats
============================================================================================================

[list the extension of you files here (file formats; e.g. .R, .csv, .mp4., jpg, etc.); 
delete this explanation line]


============================================================================================================
## CodeBook
============================================================================================================
milk kefir, water kefir, metagenomics, lactic acid bacteria, identification, sequencing

============================================================================================================
## Other
============================================================================================================
Licenses/restrictions placed on the data: 
The data is licenced for reuse under a Creative Commons 4.0 licence CC-BY-NC (https://creativecommons.org/licenses/by-nc/4.0/). 

# DISCLAIMER 
Teagasc makes no representations or warranties regarding the accuracy,completeness, or fitness for purpose of the information contained in this
dataset and accepts no liability for any loss or damage arising from its use or reuse.

# COPYRIGHT
© 2026 Teagasc (Agriculture and Food Development Authority of Ireland)

============================================================================================================
END readme
============================================================================================================
