Configuration files¶
To use proms for marker selection and model construction, two configuration files are needed.
Run configuration file¶
A run configuration file describes the settings for the run and hyperparameters. It is a yaml file with the following schema:
Key |
Description |
---|---|
repeat |
Number of cross validation repeats |
k |
Number of selected markers |
estimators |
One model will be trained for each estimator with selected markers as features |
percentile |
Percent of features to keep in the filtering step, the algorithm will determine the best “percentile” to be used |
n_jobs |
Maximum number of concurrently running workers |
Currently, the following estimators are supported (name in the parentheses should be used in the configuration file)
For classification task:
logistic regression (
lr
),support vector classifier (
svm
),eXtreme Gradient Boosting (
xgboost
),random forest (
rf
)
For regression task:
ridge regression (
ridge
)support vector regressor (
svm
),eXtreme Gradient Boosting (
xgboost
),random forest (
rf
)
For survival analysi task:
cox proportional hazards model (
coxph
)
An example run configuration file is shown below:
---
repeat: 2
k:
- 5
- 10
- 15
estimators:
- lr
- rf
- svm
percentile:
- 5
- 10
- 15
n_jobs: 20
Data configuration file¶
A data configuration file describes the input data. It is a yaml file with the following schema:
Key |
Description |
---|---|
project_name |
project short name |
data_directory |
path to the data directory. It can be an absolute path or a directory name relative to the data configuration file. |
train_data_directory |
path to the train data directory. It is relative to data_directory. |
test_data_directory |
(optional) path to the independent test data directory. It is relative to data_directory. |
target_view |
from which view should the markers be selected (see available views in the data/train/view section below) |
target_label |
column name of attribute to be predicted |
data |
information about train and (optional) test data set |
data/train |
information about train data set |
data/train/label/file |
name of the file containing train labels |
data/train/label/view |
a list of training data views. Each view consists of two items: type and file name |
data/test |
(optional) information about test data set |
data/test/label/file |
(optional) name of the file containing test labels |
data/test/label/view |
(optional) a list of test data views. Each view consists of two items: type and file name |
A sample data configuration file (crc.yml) is shown below:
---
project_name: crc
data_directory: crc_data
train_data_directory: train_data
test_data_directory: test_data
target_view: pro
target_label: msi
data:
train:
label:
file: clinical_data_train.tsv
view:
- type: mrna
file: Colon_rna_fpkm.tsv
- type: pro
file: Colon_pro_spc.tsv
test:
label:
file: clinical_data_test.tsv
view:
- type: pro
file: Colon_pro_spc_2.tsv
The corresponding directory structure is:
.
├── crc_data
│ ├── test_data
│ │ ├── Colon_pro_spc_2.tsv
│ │ └── clinical_data_train.tsv
│ └── train_data
│ ├── Colon_pro_spc.tsv
│ ├── Colon_rna_fpkm.tsv
│ └── clinical_data_test.tsv
└── crc.yml