CGEMS to dbGaP Data Submission:
There are six items that are prepared for the data submission:
* Manifest
* Study Description
* Data File
* Data Dictionary
* Subject File
* Sample Subject Mapping
*Manifest: lists all of the files that the submitter intends to send. It serves both as an inventory and a checklist for dbGaP to ensure all the files are received.Study logo needs to be included in the submission.The following manifest template can be used:
Submitted File Name |
File Type |
File Description |
File Size (in kb) |
Comments |
|
|
|
|
|
* Study Description: describes study-related information including the following items:
- Study name - character limit is 75 with spaces
- Study report name - comprehensive study name
- Abstract study description
- Disease name(s) linked to Entrez MeSH
- Inclusion/Exclusion criteria for participants (case-, control-, trio-subjects, etc.)
- Relevant Publications: PubMed IDs of most recent related articles
- Attribution: Title/Role of person in the study, Name, Institute of Affiliation
- (Institute Name, City, State, Country)
Study Description Template
Entrez Study Name (character limit is 75 with spaces): a short study name that will appear in Entrez. The short Study Name should be relatively stable between study versions.
CGEMS Breast Cancer GWAS (Illumina 550K) |
|
Webpage Study Name (no character limit): a comprehensive study name that will appear on the upper left hand corner of the study webpage. This name length can be longer than the Entrez Study Name. Also, this name can be different between study versions, since each study version will have a different webpage.
National Cancer Institute Cancer Genetics Markers of Susceptibility (CGEMS) Breast Cancer GWAS (Illumina 550K) |
|
Description: an original summary description of the study. If the description is taken verbatim from a published or soon to be published article, please submit copyright permission from the Journal. Summaries with copyrighted material must include the following within the description: "Reprinted from [translationalResearch:Article Citation], with permission from [translationalResearch:Publisher]." |
Cancer Genetic Markers of Susceptibility (CGEMS) Phase 1: Breast Cancer Whole Genome Association Scan, which is being conducted to identify genetic variants that influence susceptibility to breast cancer. Using the Illumina HumanHap550 assay, this phase screens 550,000 SNP markers from across the genome that were typed on approximately 1,140 breast cancer cases and equivalent number of controls. The goal of this phase is to scan the genome to find genetic variants to aid in the prevention and treatment of breast cancer. Approximately 5% of the most promising variants will be carried forward to further replication and fine-mapping phases. |
Type: the study type(s) (Longitudinal, Case-control, Case-set, Control-set, Trio, Cohort, etc). |
CASE
CONTROL |
Disease name(s): any number of disease name(s) associated with this study. The disease name must be a MeSH term (*http://www.ncbi.nlm.nih.gov/sites/entrez?db=mesh*).** * *To check, type in the MeSH search box: disease of interest [translationalResearch:mh]. Disease name will be ordered as submitted. |
|
Inclusion/Exclusion Criteria: the inclusion and exclusion criteria for cases, controls, trios, participants as applicable. |
The Nurses' Health Study4 (NHS) is a longitudinal study of 121,700 women enrolled in
1976. The CGEMS case-control study is derived from 32,826 participants who provided a blood sample between 1989 and 1990 and were free of diagnosed breast cancer at blood collection and followed for incident disease until May 2004. Cancer follow-up in the
NHS was conducted by personal mailings and searches of the National Death Index. It is estimated that the percentage of true cancers captured by this system is greater than 90%.
Permission was requested from all participants diagnosed with cancer to review medical records to confirm the diagnoses and obtain additional information on tumor histology, staging, and other characteristics. All study participants who were menopausal at blood draw with a confirmed diagnosis of invasive breast cancer and had sufficient stored blood available for DNA extraction at the time of case and control selection were included as cases in the CGEMS project. Controls were matched to cases based on age, blood collection variables (time, date, and year of blood collection, as well as recent (<3 months) use of postmenopausal hormones), ethnicity (all cases and controls are self reported Caucasians), and menopausal status (all cases and controls were menopausal at blood draw).
Informed consent was obtained from all participants. The study was approved by the Institutional Review Board of the Brigham and Women's Hospital, Boston, MA, USA. |
History: the study history as applicable. |
|
Study Attribution: will appear as submitted. |
Header |
Name |
Affiliation |
Principal Investigator |
|
|
Institute |
|
|
Funding Source |
|
|
*Data File : the following phenotypes are included for CGEMS data:
The sql statement to generate the phenotype data:
select PARTICIPANT_DID,AGE_AT_ENROLL_MIN||'-'||AGE_AT_ENROLL_MAX,
CASE_CONTROL_STATUS,GENDER,FAMILY_HISTORY
from STUDY_PARTICIPANT
where STUDY_ID=3
*Data Dictionary file: describes variables that are included in phenotype data file, for example:
VARNAME |
VARDESC |
DOCFILE |
TYPE |
UNITS |
COMMENT |
VALUES |
PARTICIPANT_DID |
Deidentified Participant's ID |
cgems_data_phenotypes.
Xls (the data file name) |
integer |
|
Deidentified ID |
|
*Subject File: for cgems data, the following attributes are collected and saved as a txt file.
- Subject source* (SUBJ_SOURCE)
- Source SUBJID* (SOURCE_SUBJID)
The sql statement to generate the subject data file:
select PARTICIPANT_DID as Subject_ID, 'NCI IRB NHS'
Consent_Group,PARTICIPANT_DID as Subject_Source,
CASE_CONTROL_STATUS
from STUDY_PARTICIPANT
where study_id=3
*Subject Data Dictionary: describes variables that are included in subject data file, the template is as follows:
VARNAME |
VARDESC |
TYPE |
SUBJID |
Subject ID |
integer |
CONSENT |
Consent group as determined by DAC |
encoded value |
SUBJ_SOURCE |
Source repository where subjects originate |
string |
SOURCE_SUBJID |
Subject ID used in the Source Repository |
integer |
*Sample Subject Mapping: lists sample IDs of each individual DNA sample for which genotype information has been submitted and the corresponding individual subject ID for which phenotype data has been submitted. The following columns are included:
The sql statement to generate the sample subject mapping data file:
select SPECIMEN_ID,PARTICIPANT_DID
from specimen
where PARTICIPANT_DID
in ( select PARTICIPANT_DID
from STUDY_PARTICIPANT
where study_id=3)