Usage¶
Info¶
As a rule, 4 files are provided in each methylation dataset:
betas.txt
- contains methylation data itself. Rows correspond to individual CpGs, and columns correspond to subjects.annotations.txt
- contains information about CpGs. Rows correspond to individual CpGs, and columns correspond to CpG’s characteristics (gene, bop, coordinate, etc.).observables.txt
- contains information about subjects. Rows correspond to subjects, and columns correspond to subject’s observables (age, gender, disease, etc.).cells.txt
- contains information about cell types population. For example, if DNA methylation profiles taken from human whole blood, then for each patient a different proportion of blood cells types is possible. Rows in file correspond to subjects, and columns correspond different cell types proportions.
The first line in each file is usually a header. File names and file extensions may differ, but content is the same.
Currently supported only .txt
extension.
These files must be located in the same directory. After running experiments, new directories with results
and cached data files with .pkl
and .npz
extensions will appear in this directory.
For all experiments provided by pydnameth
you need to specify config information.
Config¶
For each experiment you need to create instances:
pdm.Data
pdm.Annotations
pdm.Attributes
Data¶
pdm.Data
contains information about dataset.
For creating instance of pdm.Data
you need to specify next fields:
name
¶
Name of the file without extension (currently supported only .txt
extension),
which contains methylation data.
Example:
name = 'betas'
base
¶
Name of the directory, where the necessary files are located and where the files with the results will be saved.
Example:
base = 'GSE87571'
Example¶
Example of creating pdm.Data
instance:
data = pdm.Data(
name='betas',
path='C:/Data',
base='GSE40279'
)
Annotations¶
pdm.Annotations
allows you to define a subset of CpGs that will be considered in the experiment.
For creating instance of pdm.Annotations
you need to specify next fields:
name
¶
Name of the file without extension (currently supported only .txt
extension),
which contains information about CpGs.
Example:
name = 'annotations'
exclude¶
Name of the file without extension (currently supported only .txt
extension),
which contains CpGs to exclude.
If equals to 'none'
, then no excluded CpGs.
Example:
exclude = 'none'
cross_reactive¶
Should cross-reactive CpGs be considered in the experiment?
Currently supported options (string
):
'ex'
- excluded all cross-reactive CpGs.'any'
- all CpGs are considered.
Example:
cross_reactive = 'ex'
snp¶
Should SNP CpGs be considered in the experiment?
Currently supported options (string
):
'ex'
- excluded all SNP CpGs.'any'
- all CpGs are considered.
Example:
snp = 'ex'
chr¶
What chromosomes are considered in the experiment?
Currently supported options (string
):
'NS'
- CpGs only on non-sex chromosomes are considered.'X'
- CpGs only on X chromosome are considered.'Y'
- CpGs only on Y chromosome are considered.'any'
- all CpGs are considered.
Example:
chr = 'NS'
gene_region¶
Should we consider CpGs which are mapped on genes?
Currently supported options (string
):
'yes'
- only CpGs which are mapped on genes are considered.'no'
- only CpGs which are not mapped on genes are considered.'any'
- all CpGs are considered.
Example:
gene_region = 'yes'
geo¶
CpGs on what geo-types should be considered?
Currently supported options (string
):
'shores'
- only CpGs on shores are considered.'shores_s'
- only CpGs on southern shores are considered.'shores_n'
- only CpGs on northern shores are considered.'islands'
- only CpGs on islands are considered.'islands_shores'
- only CpGs on islands or shores are considered.'any'
- all CpGs are considered.
Example:
gene_region = 'any'
probe_class¶
What CpGs probe class should be considered?
Currently supported options (string
):
'A'
- class A CpGs are considered.'B'
- class B CpGs are considered.'C'
- class C CpGs are considered.'D'
- class D CpGs are considered.'A_B'
- class A and B CpGs are considered.'any'
- all CpGs are considered.
Example:
probe_class = 'any'
Example¶
Example of creating pdm.Annotations
instance:
annotations = pdm.Annotations(
name='annotations',
exclude='none',
cross_reactive='ex',
snp='ex',
chr='NS',
gene_region='yes',
geo='any',
probe_class='any'
)
Attributes¶
pdm.Attributes
describes information about subjects.
For creating instance of pdm.Attributes
you need to specify next fields:
observables
¶
Specifies observables of subjects under consideration. Should be pdm.Observables
instance.
For creating pdm.Observables
instance you need to specify:
name
- name of the file without extension (currently supported only.txt
extension), which contains information about subjects.
Example:
name = 'observables'
types
- pythondict
withkey
- header of target observable andvalue
- value of target observable. Also values can be'any'
if you want to consider all existing values.
Example:
{'gender': 'F'}
cells
¶
Specifies cell types population. Should be pdm.Cells
instance.
For creating pdm.Cells
instance you need to specify:
name
- name of the file without extension (currently supported only.txt
extension), contains information about cell types population.
Example:
name = 'cells'
types
- pythonlist
of cell types which should be considered in the experiment (string
headers infile_name
) or string'any'
if you want to consider all cells types.
Example:
types = ['Monocytes', 'B', 'CD4T', 'NK', 'CD8T', 'Gran']
Example¶
Example of creating pdm.Attributes
instance:
observables = pdm.Observables(
name='observables',
types={'gender': 'F'}
)
cells = pdm.Cells(
name='cells',
types='any'
)
attributes = pdm.Attributes(
target='age',
observables=observables,
cells=cells
)
Released Experiments¶
The name of the functions provided by the pydnameth
package are follow the next logic:
First part is data type for the experiment. For example,
betas
,residuals
orattributes
.Second part answers the question:
WHAT WE WANT TO DO?
. For example,table
- table with data and characteristics processing,plot
- data plotting.Third part answers the question:
HOW WE WANT TO DO?
. Specifies the method for the experiment. For example,linreg
- linear regression method.
Currently released functions:
-
pydnameth.scripts.develop.betas.clock.
betas_clock_special
(data, annotations, attributes, file, method_params=None)[source]¶ Producing epigentic clock, using best CpGs which are provided in input file.
Epigentic clock represents as table: Each row corresponds to clocks, which are built on all CpGs from the previous rows including the current row. Columns:
item: CpG id.
aux: gene, on which CpG is mapped.
R2: determination coefficient of linear regression between real and predicted target observable. A statistical measure of how well the regression line approximates the data points.
r: correlation coefficient of linear regression between real and predicted target observable.
evs: explained variance regression score.
mae: mean absolute error regression loss.
rmse: root mean square error
Possible parameters of experiment:
'type'
: type of clocks.Possible options:
'all'
: iterative building of clocks starting from one element in the model, ending with'size'
elements in the model.'single '
: building of clocks only with'size'
elements in the model.'deep'
: iterative building of clocks starting from one element in the model, ending with'size'
elements in the model, but choosing all possible combinations from'size'
elements.'part'
: the proportion of considered number of subject in the test set. From0.0
to1.0
.'size'
: maximum number of exogenous variables in a model.'runs'
number of bootstrap runs in model
- Parameters
data – pdm.Data instance, which specifies information about dataset.
annotations – pdm.Annotations instance, which specifies subset of CpGs.
attributes – pdm.Attributes instance, which specifies information about subjects.
method_params – parameters of experiment.
-
pydnameth.scripts.develop.betas.plot.
betas_plot_scatter
(data, annotations, attributes, observables_list, child_method=<Method.linreg: 'linreg'>, data_params=None, method_params=None)[source]¶ Plotting methylation level from observables as scatter for provided subjects subsets and provided CpG list.
Possible parameters of experiment:
'x_range'
: can be'auto'
or list with two elements, which are borders of target axis.…
- Parameters
data – pdm.Data instance, which specifies information about dataset.
annotations – pdm.Annotations instance, which specifies subset of CpGs.
attributes – pdm.Attributes instance, which specifies information about subjects.
observables_list – list of subjects subsets. Each element in list is dict, where
key
is observable name andvalue
is possible values for this observable.method_params – parameters of experiment.
-
pydnameth.scripts.develop.betas.table.
betas_table_aggregator_linreg
(data, annotations, attributes, observables_list, data_params=None, method_params=None)[source]¶ Producing table with information about observable-specificity of target data type and target observable for each CpG.
Columns:
item: CpG id.
aux: gene, on which CpG is mapped.
area_intersection_rel: relative intersection area of polygons which is equals area of polygon(s) intersection to area of polygons union ratio.
slope_intersection_rel: relative intersection area of allowed regions for slopes of linear regression.
max_abs_slope: maximal absolute slope between all provided subjects subsets
…
z_value: number of standard deviations by which data point is above the mean value.
The considered data point is the difference between two linear regressions slopes.
abs_z_value: absolute z_value
p_value: probability of rejecting the null hypothesis that the difference in slopes is zero.
…
For each subjects subset the next columns are added to the resulting table:
R2_***: determination coefficient. A statistical measure of how well the regression line approximates the data points.
intercept_***: estimated value of the intercept of linear regression.
slope_***: estimated value of the slope of linear regression.
intercept_std_***: standard error of the estimate of the intercept of linear regression.
slope_std_***: standard error of the estimate of the slope of linear regression.
intercept_p_value_***: p-value for the intercept of linear regression.
slope_p_pvalue_***: p-value for the slope of linear regression.
…
Where *** is the name of subjects subset.
Possible parameters of experiment:
None
- Parameters
data – pdm.Data instance, which specifies information about dataset.
annotations – pdm.Annotations instance, which specifies subset of CpGs.
attributes – pdm.Attributes instance, which specifies information about subjects.
observables_list – list of subjects subsets. Each element in list is dict, where
key
is observable name andvalue
is possible values for this observable.method_params – parameters of experiment.
-
pydnameth.scripts.develop.betas.table.
betas_table_linreg
(data, annotations, attributes, method_params=None, data_params=None)[source]¶ Producing table with information for linear regression between beta values and methylation level for each CpG.
Each row corresponds to specific CpG.
Columns:
item: CpG id.
aux: gene, on which CpG is mapped.
R2: determination coefficient. A statistical measure of how well the regression line approximates the data points.
intercept: estimated value of the intercept of linear regression.
slope: estimated value of the slope of linear regression.
intercept_std: standard error of the estimate of the intercept of linear regression.
slope_std: standard error of the estimate of the slope of linear regression.
intercept_p_value: p-value for the intercept of linear regression.
slope_p_pvalue: p-value for the slope of linear regression.
…
Possible parameters of experiment:
None
- Parameters
data – pdm.Data instance, which specifies information about dataset.
annotations – pdm.Annotations instance, which specifies subset of CpGs.
attributes – pdm.Attributes instance, which specifies information about subjects.
method_params – parameters of experiment.
-
pydnameth.scripts.develop.observables.plot.
observables_plot_histogram
(data, annotations, attributes, observables_list, method_params=None)[source]¶ Plotting histogram for target observable distribution for provided subjects subsets and provided CpG list.
Possible parameters of experiment:
'bin_size'
: bin size for numeric target.For categorical target is not considered.
'opacity'
: opacity level. From0.0
to1.0
.'barmode'
: type of barmode.Possible options:
'overlay'
for overlaid histograms.'stack'
for stacked histograms.'x_range'
: can be'auto'
or list with two elements, which are borders of target axis.
- Parameters
data – pdm.Data instance, which specifies information about dataset.
annotations – pdm.Annotations instance, which specifies subset of CpGs.
attributes – pdm.Attributes instance, which specifies information about subjects.
cpg_list – List of CpGs for plotting
observables_list – list of subjects subsets. Each element in list is dict, where
key
is observable name andvalue
is possible values for this observable.method_params – parameters of experiment.
Usage Examples¶
attributes_plot_observables_histogram¶
import pydnameth as pdm
data = pdm.Data(
name='cpg_beta',
path='C:/Data',
base='GSE87571'
)
annotations = pdm.Annotations(
name='annotations',
exclude='none',
cross_reactive='ex',
snp='ex',
chr='NS',
gene_region='yes',
geo='any',
probe_class='any'
)
observables = pdm.Observables(
name='observables',
types={}
)
cells = pdm.Cells(
name='cells',
types='any'
)
attributes = pdm.Attributes(
target='age',
observables=observables,
cells=cells
)
observables_list = [
{'gender': 'F'},
{'gender': 'M'}
]
pdm.attributes_plot_observables_histogram(
data=data,
annotations=annotations,
attributes=attributes,
observables_list=observables_list,
params={
'bin_size': 1.0,
'opacity': 0.75,
'barmode': 'overlay'
}
)
cpg_plot_methylation_scatter¶
import pydnameth as pdm
cpg_list = [
'cg13982318',
'cg11868595',
'cg08900404'
]
data = pdm.Data(
name='cpg_beta',
path='C:/Data',
base='GSE87571'
)
annotations = pdm.Annotations(
name='annotations',
exclude='none',
cross_reactive='ex',
snp='ex',
chr='NS',
gene_region='yes',
geo='any',
probe_class='any'
)
observables = pdm.Observables(
name='observables',
types={}
)
cells = pdm.Cells(
name='cells',
types='any'
)
attributes = pdm.Attributes(
target='age',
observables=observables,
cells=cells
)
observables_list = [
{'gender': 'F'},
{'gender': 'M'}
]
pdm.cpg_plot_methylation_scatter(
data=data,
annotations=annotations,
attributes=attributes,
observables_list=observables_list,
cpg_list=cpg_list,
params={
'x_range': [10, 110]
}
)
cpg_proc_clock_linreg¶
import pydnameth as pdm
data = pdm.Data(
name='cpg_beta',
path='C:/Data',
base='GSE87571'
)
annotations = pdm.Annotations(
name='annotations',
exclude='none',
cross_reactive='ex',
snp='ex',
chr='NS',
gene_region='yes',
geo='any',
probe_class='any'
)
cells = pdm.Cells(
name='cells',
types='any'
)
obs_list = [
{'gender': 'F'},
{'gender': 'M'},
{'gender': 'any'}
]
for obs in obs_list:
observables = pdm.Observables(
name='observables',
types=obs
)
attributes = pdm.Attributes(
target='age',
observables=observables,
cells=cells
)
pdm.cpg_proc_clock_linreg(
data=data,
annotations=annotations,
attributes=attributes,
params={
'type': 'all',
'part': 0.25,
'size': 100,
'runs': 100,
}
)
cpg_proc_table_linreg¶
import pydnameth as pdm
data = pdm.Data(
name='cpg_beta',
path='C:/Data',
base='GSE87571'
)
annotations = pdm.Annotations(
name='annotations',
exclude='none',
cross_reactive='ex',
snp='ex',
chr='NS',
gene_region='yes',
geo='any',
probe_class='any'
)
cells = pdm.Cells(
name='cells',
types='any'
)
obs_list = [
{'gender': 'F'},
{'gender': 'M'},
{'gender': 'any'}
]
for obs in obs_list:
observables = pdm.Observables(
name='observables',
types=obs
)
attributes = pdm.Attributes(
target='age',
observables=observables,
cells=cells
)
pdm.cpg_proc_table_linreg(
data=data,
annotations=annotations,
attributes=attributes
)
cpg_proc_table_variance_linreg¶
import pydnameth as pdm
data = pdm.Data(
name='cpg_beta',
path='C:/Data',
base='GSE87571'
)
annotations = pdm.Annotations(
name='annotations',
exclude='none',
cross_reactive='ex',
snp='ex',
chr='NS',
gene_region='yes',
geo='any',
probe_class='any'
)
cells = pdm.Cells(
name='cells',
types='any'
)
obs_list = [
{'gender': 'F'},
{'gender': 'M'},
{'gender': 'any'}
]
for obs in obs_list:
observables = pdm.Observables(
name='observables',
types=obs
)
attributes = pdm.Attributes(
target='age',
observables=observables,
cells=cells
)
pdm.cpg_proc_table_variance_linreg(
data=data,
annotations=annotations,
attributes=attributes
)
cpg_proc_table_polygon¶
import pydnameth as pdm
data = pdm.Data(
name='cpg_beta',
path='C:/Data',
base='GSE87571'
)
annotations = pdm.Annotations(
name='annotations',
exclude='none',
cross_reactive='ex',
snp='ex',
chr='NS',
gene_region='yes',
geo='any',
probe_class='any'
)
observables = pdm.Observables(
name='observables',
types={}
)
cells = pdm.Cells(
name='cells',
types='any'
)
attributes = pdm.Attributes(
target='age',
observables=observables,
cells=cells
)
observables_list = [
{'gender': 'F'},
{'gender': 'M'}
]
pdm.cpg_proc_table_polygon(
data=data,
annotations=annotations,
attributes=attributes,
observables_list=observables_list
)
cpg_proc_table_z_test_linreg¶
import pydnameth as pdm
data = pdm.Data(
name='cpg_beta',
path='C:/Data',
base='EPIC'
)
annotations = pdm.Annotations(
name='annotations',
exclude='none',
cross_reactive='ex',
snp='ex',
chr='NS',
gene_region='yes',
geo='any',
probe_class='any'
)
observables = pdm.Observables(
name='observables',
types={}
)
cells = pdm.Cells(
name='cells',
types='any'
)
attributes = pdm.Attributes(
target='age',
observables=observables,
cells=cells
)
observables_list = [
{'gender': 'F'},
{'gender': 'M'}
]
pdm.cpg_proc_table_z_test_linreg(
data=data,
annotations=annotations,
attributes=attributes,
observables_list=observables_list
)