BRU: Data Formatting Requirements - Do’s and Don’ts




Please make sure to REMOVE all patient identifiers (name, initials, DOB, address, MRN, OHIP no.,etc.) before sending data to BRU.


General Format:


  • DO: Data should have the format of a table, where each row corresponds to an observation unit (e.g. subject; event; etc)* and each column to a variable (e.g. age, gender, etc.)
  • DO: Comma-separated values (.CSV) file type is preferred, however Excel files (.xls; .xlsx) are acceptable.




  • DO: The first row of the data should contain the variable (column) name.
  • DO: Names should be kept short.
  • DON’T: Names should not start with a numeric character.
  • DON’T: Names should not contain any commas or empty spaces.
  • DON’T use the same name for multiple columns.
  • DON’T have multiple header rows.
  • DON’T leave empty rows after the header row (actual data values should start on row 2)


Unique ID:


  • DO: Observation units (e.g. patients, studies etc.) should be uniquely identified by an ID key (a number or a number-letter combination).
  • DO: It is preferred that this unique ID should be the first column of the data table. *


Variable (Column) Types:


  • Numerical
    • DO: Numeric variables should only contain numbers.
    • DO: For a value of zero, the number 0 should be used, as opposed to leaving the cell empty.
    • DON’T: They should not contain other character type values (e.g. “missing”, “unknown”, “one”, “less than ten” etc.).



  • Categorical (including Binary)
    • DO: A coding legend (e.g. 0 = No; 1 = Yes) needs to be provided if numbers are used. **
    • DON’T use different notations for the same value (e.g. F and f and female and Female) in the same variable.
    • DON’T colour code variables.


Examples of categorical variables with appropriate coding and legend:

Variable Name:




Blood type


(Yes / No)

(male / female)

(Yes / No)

(A, B, AB, 0)


0 / 1


0 / 1


Coding legend:

(0 = No; 1=Yes)

(M=male; F=female)

(0 = No; 1=Yes)


  • Date
    • DO: If using Excel, make sure the cells are formatted as “Date”.
    • DO: To avoid any confusion, please choose the format with month spelled out in words (e.g. use “March 11, 2010”, not “11/03/10”)

Missing Values:


  • DO: Missing values should only be coded as NA (preferred) or the cell left blank.
  • DO: Use only one of the above 2 options throughout the whole dataset (either NA or blank cell).
  • DON’T: No other values (numeric or character) should be used for indicating missing values (e.g. 9, 999, -1, “missing” etc.).
  • DO: create variables that help identify whether missing data is unknown or not applicable (e.g. ‘name_drug’ missing due to unknown name or due to drugs not being taken by the subject? Solution: add a Yes/No “drugs_taken” variable).


Data dictionary:


  • DO: Every dataset should be accompanied by a separate document (or a tab in the spreadsheet) that says:
    • what each variable name means and
    • what any numeric coding refers to (see “Coding Legends” in the examples of categorical variables above)
  • DON’T use the column number (e.g. column H) to refer to a variable (always use its name).


Additional Tips:


  • DON’T include summary statistics in the data (e.g. means, medians in the rows below the actual data; or graphs on the same sheet as the data)
  • DON’T group the data with blank or header rows or columns. Indicate groups by a column (e.g “Group” with coding 1/2/3 that is defined in the Data Dictionary)


Longitudinal Data (Repeated Measures):


  • DO: Use multiple rows per observation unit (e.g. patient), one for each measurement (i.e. “long” format).
  • DO: Each row should be uniquely identified by one or more variables (e.g. Subject_ID and Measurement_No and/or Measurement_Time).
  • DO: Use a separate data table (one row per unit) for baseline variables.


* Please see the Longitudinal Data section for additional information
** Please see the Data Dictionary section for additional information