CGEMS - ICR - National Cancer Institute

CGEMS to dbGaP Data Submission:

There are six items that are prepared for the data submission:

* Manifest

* Study Description

* Data File

* Data Dictionary

* Subject File

* Sample Subject Mapping

*Manifest: lists all of the files that the submitter intends to send. It serves both as an inventory and a checklist for dbGaP to ensure all the files are received.Study logo needs to be included in the submission.The following manifest template can be used:

Submitted File Name	File Type	File Description	File Size (in kb)	Comments

* Study Description: describes study-related information including the following items:

Study name - character limit is 75 with spaces

Study report name - comprehensive study name

Abstract study description

Study URL

Type

Disease name(s) linked to Entrez MeSH

Inclusion/Exclusion criteria for participants (case-, control-, trio-subjects, etc.)

Study History

Relevant Publications: PubMed IDs of most recent related articles

Attribution: Title/Role of person in the study, Name, Institute of Affiliation

(Institute Name, City, State, Country)

Study Description Template

Entrez Study Name (character limit is 75 with spaces): a short study name that will appear in Entrez. The short Study Name should be relatively stable between study versions.
CGEMS Breast Cancer GWAS (Illumina 550K)

Webpage Study Name (no character limit): a comprehensive study name that will appear on the upper left hand corner of the study webpage. This name length can be longer than the Entrez Study Name. Also, this name can be different between study versions, since each study version will have a different webpage.
National Cancer Institute Cancer Genetics Markers of Susceptibility (CGEMS) Breast Cancer GWAS (Illumina 550K)

Description: an original summary description of the study. If the description is taken verbatim from a published or soon to be published article, please submit copyright permission from the Journal. Summaries with copyrighted material must include the following within the description: "Reprinted from [translationalResearch:Article Citation], with permission from [translationalResearch:Publisher]."

Cancer Genetic Markers of Susceptibility (CGEMS) Phase 1: Breast Cancer Whole Genome Association Scan, which is being conducted to identify genetic variants that influence susceptibility to breast cancer. Using the Illumina HumanHap550 assay, this phase screens 550,000 SNP markers from across the genome that were typed on approximately 1,140 breast cancer cases and equivalent number of controls. The goal of this phase is to scan the genome to find genetic variants to aid in the prevention and treatment of breast cancer. Approximately 5% of the most promising variants will be carried forward to further replication and fine-mapping phases.

Study URL: the study URL(s) if applicable.

http://cgems.cancer.gov/
https://caintegrator.nci.nih.gov/cgems/

Type: the study type(s) (Longitudinal, Case-control, Case-set, Control-set, Trio, Cohort, etc).

CASE
CONTROL

Disease name(s): any number of disease name(s) associated with this study. The disease name must be a MeSH term (*http://www.ncbi.nlm.nih.gov/sites/entrez?db=mesh*).** * *To check, type in the MeSH search box: disease of interest [translationalResearch:mh]. Disease name will be ordered as submitted.

Inclusion/Exclusion Criteria: the inclusion and exclusion criteria for cases, controls, trios, participants as applicable.

The Nurses' Health Study4 (NHS) is a longitudinal study of 121,700 women enrolled in
1976. The CGEMS case-control study is derived from 32,826 participants who provided a blood sample between 1989 and 1990 and were free of diagnosed breast cancer at blood collection and followed for incident disease until May 2004. Cancer follow-up in the
NHS was conducted by personal mailings and searches of the National Death Index. It is estimated that the percentage of true cancers captured by this system is greater than 90%.
Permission was requested from all participants diagnosed with cancer to review medical records to confirm the diagnoses and obtain additional information on tumor histology, staging, and other characteristics. All study participants who were menopausal at blood draw with a confirmed diagnosis of invasive breast cancer and had sufficient stored blood available for DNA extraction at the time of case and control selection were included as cases in the CGEMS project. Controls were matched to cases based on age, blood collection variables (time, date, and year of blood collection, as well as recent (<3 months) use of postmenopausal hormones), ethnicity (all cases and controls are self reported Caucasians), and menopausal status (all cases and controls were menopausal at blood draw).
Informed consent was obtained from all participants. The study was approved by the Institutional Review Board of the Brigham and Women's Hospital, Boston, MA, USA.

History: the study history as applicable.

Relevant Publications: use Pubmed IDs (*http://www.ncbi.nlm.nih.gov/PubMed/*).** * *References will appear in the order submitted.

Study Attribution: will appear as submitted.
Header	Name	Affiliation
Principal Investigator
Institute
Funding Source

*Data File : the following phenotypes are included for CGEMS data:

Age (5 year intervals)

Case control status

Gender

Family history (+/-)

The sql statement to generate the phenotype data:

select PARTICIPANT_DID,AGE_AT_ENROLL_MIN||'-'||AGE_AT_ENROLL_MAX,

CASE_CONTROL_STATUS,GENDER,FAMILY_HISTORY

from STUDY_PARTICIPANT

where STUDY_ID=3

*Data Dictionary file: describes variables that are included in phenotype data file, for example:

VARNAME	VARDESC	DOCFILE	TYPE	UNITS	COMMENT	VALUES
PARTICIPANT_DID	Deidentified Participant's ID	cgems_data_phenotypes. Xls (the data file name)	integer		Deidentified ID

*Subject File: for cgems data, the following attributes are collected and saved as a txt file.

Subject ID (SUBJID)

Consent group (CONSENT)

Subject source* (SUBJ_SOURCE)

Source SUBJID* (SOURCE_SUBJID)

The sql statement to generate the subject data file:

select PARTICIPANT_DID as Subject_ID, 'NCI IRB NHS'

Consent_Group,PARTICIPANT_DID as Subject_Source,

CASE_CONTROL_STATUS

from STUDY_PARTICIPANT

where study_id=3

*Subject Data Dictionary: describes variables that are included in subject data file, the template is as follows:

VARNAME	VARDESC	TYPE
SUBJID	Subject ID	integer
CONSENT	Consent group as determined by DAC	encoded value
SUBJ_SOURCE	Source repository where subjects originate	string
SOURCE_SUBJID	Subject ID used in the Source Repository	integer

*Sample Subject Mapping: lists sample IDs of each individual DNA sample for which genotype information has been submitted and the corresponding individual subject ID for which phenotype data has been submitted. The following columns are included:

Specimen id

Subject id

The sql statement to generate the sample subject mapping data file:

select SPECIMEN_ID,PARTICIPANT_DID

from specimen

where PARTICIPANT_DID

in ( select PARTICIPANT_DID

from STUDY_PARTICIPANT

where study_id=3)

Labels:

CGEMS to dbGaP Data Submission: