Configuration files¶

To use proms for marker selection and model construction, two configuration files are needed.

Run configuration file¶

A run configuration file describes the settings for the run and hyperparameters. It is a yaml file with the following schema:

Key	Description
repeat	Number of cross validation repeats
k	Number of selected markers
estimators	One model will be trained for each estimator with selected markers as features
percentile	Percent of features to keep in the filtering step, the algorithm will determine the best “percentile” to be used
n_jobs	Maximum number of concurrently running workers

Currently, the following estimators are supported (name in the parentheses should be used in the configuration file)

For classification task:

logistic regression (lr),
support vector classifier (svm),
eXtreme Gradient Boosting (xgboost),
random forest (rf)

For regression task:

ridge regression (ridge)
support vector regressor (svm),
eXtreme Gradient Boosting (xgboost),
random forest (rf)

For survival analysi task:

cox proportional hazards model (coxph)

An example run configuration file is shown below:

---
repeat: 2
k:
- 5
- 10
- 15
estimators:
- lr
- rf
- svm
percentile:
- 5
- 10
- 15
n_jobs: 20

Data configuration file¶

A data configuration file describes the input data. It is a yaml file with the following schema:

Key	Description
project_name	project short name
data_directory	path to the data directory. It can be an absolute path or a directory name relative to the data configuration file.
train_data_directory	path to the train data directory. It is relative to data_directory.
test_data_directory	(optional) path to the independent test data directory. It is relative to data_directory.
target_view	from which view should the markers be selected (see available views in the data/train/view section below)
target_label	column name of attribute to be predicted
data	information about train and (optional) test data set
data/train	information about train data set
data/train/label/file	name of the file containing train labels
data/train/label/view	a list of training data views. Each view consists of two items: type and file name
data/test	(optional) information about test data set
data/test/label/file	(optional) name of the file containing test labels
data/test/label/view	(optional) a list of test data views. Each view consists of two items: type and file name

A sample data configuration file (crc.yml) is shown below:

---
project_name: crc
data_directory: crc_data
train_data_directory: train_data
test_data_directory: test_data
target_view: pro
target_label: msi
data:
  train:
    label:
      file: clinical_data_train.tsv
    view:
    - type: mrna
      file: Colon_rna_fpkm.tsv
    - type: pro
      file: Colon_pro_spc.tsv
  test:
    label:
      file: clinical_data_test.tsv
    view:
    - type: pro
      file: Colon_pro_spc_2.tsv

The corresponding directory structure is:

.
├── crc_data
│   ├── test_data
│   │   ├── Colon_pro_spc_2.tsv
│   │   └── clinical_data_train.tsv
│   └── train_data
│       ├── Colon_pro_spc.tsv
│       ├── Colon_rna_fpkm.tsv
│       └── clinical_data_test.tsv
└── crc.yml