Important:
Please make sure to REMOVE all patient identifiers (name, initials, DOB, address, MRN, OHIP no.,etc.) before sending data to BRU.
General Format:
- DO: Data should have the format of a table, where each row corresponds to an observation unit (e.g. subject; event; etc)* and each column to a variable (e.g. age, gender, etc.)
- DO: Comma-separated values (.CSV) file type is preferred, however Excel files (.xls; .xlsx) are acceptable.
Headers:
- DO: The first row of the data should contain the variable (column) name.
- DO: Names should be kept short.
- DON’T: Names should not start with a numeric character.
- DON’T: Names should not contain any commas or empty spaces.
- DON’T use the same name for multiple columns.
- DON’T have multiple header rows.
- DON’T leave empty rows after the header row (actual data values should start on row 2)
Unique ID:
- DO: Observation units (e.g. patients, studies etc.) should be uniquely identified by an ID key (a number or a number-letter combination).
- DO: It is preferred that this unique ID should be the first column of the data table. *
Variable (Column) Types:
- Numerical
- DO: Numeric variables should only contain numbers.
- DO: For a value of zero, the number 0 should be used, as opposed to leaving the cell empty.
- DON’T: They should not contain other character type values (e.g. “missing”, “unknown”, “one”, “less than ten” etc.).
- Categorical (including Binary)
- DO: A coding legend (e.g. 0 = No; 1 = Yes) needs to be provided if numbers are used. **
- DON’T use different notations for the same value (e.g. F and f and female and Female) in the same variable.
- DON’T colour code variables.
Examples of categorical variables with appropriate coding and legend:
Variable Name: |
Smoking |
Sex |
Male |
Blood type |
Values: |
(Yes / No) |
(male / female) |
(Yes / No) |
(A, B, AB, 0) |
Coding: |
0 / 1 |
M/F |
0 / 1 |
A/B/AB/O |
Coding legend: |
(0 = No; 1=Yes) |
(M=male; F=female) |
(0 = No; 1=Yes) |
- Date
- DO: If using Excel, make sure the cells are formatted as “Date”.
- DO: To avoid any confusion, please choose the format with month spelled out in words (e.g. use “March 11, 2010”, not “11/03/10”)
Missing Values:
- DO: Missing values should only be coded as NA (preferred) or the cell left blank.
- DO: Use only one of the above 2 options throughout the whole dataset (either NA or blank cell).
- DON’T: No other values (numeric or character) should be used for indicating missing values (e.g. 9, 999, -1, “missing” etc.).
- DO: create variables that help identify whether missing data is unknown or not applicable (e.g. ‘name_drug’ missing due to unknown name or due to drugs not being taken by the subject? Solution: add a Yes/No “drugs_taken” variable).
Data dictionary:
- DO: Every dataset should be accompanied by a separate document (or a tab in the spreadsheet) that says:
- what each variable name means and
- what any numeric coding refers to (see “Coding Legends” in the examples of categorical variables above)
- DON’T use the column number (e.g. column H) to refer to a variable (always use its name).
Additional Tips:
- DON’T include summary statistics in the data (e.g. means, medians in the rows below the actual data; or graphs on the same sheet as the data)
- DON’T group the data with blank or header rows or columns. Indicate groups by a column (e.g “Group” with coding 1/2/3 that is defined in the Data Dictionary)
Longitudinal Data (Repeated Measures):
- DO: Use multiple rows per observation unit (e.g. patient), one for each measurement (i.e. “long” format).
- DO: Each row should be uniquely identified by one or more variables (e.g. Subject_ID and Measurement_No and/or Measurement_Time).
- DO: Use a separate data table (one row per unit) for baseline variables.
* Please see the Longitudinal Data section for additional information
** Please see the Data Dictionary section for additional information
File dowload