This tutorial is meant for understanding the utility of DFilter and EFilter for different kinds of data from next gen. sequencing. The topics covered are as such

Using DFilter

a) Installing
b) Introduction : peak detection, making wiggle track
c) Variable peak width, assays and optimal parameters
d) Table for suggested parameter values for different assays
e) Combining multiple kinds of data
f) More detail option wise explanation

Using EFilter

Installing:

Linux complied version:

. unzip the compiled codes tar file (using command tar -xvf dfilter.tar.gz) and include the unzipped folder in PATH using command

export PATH=path_of_DFilter_folder:$PATH

Matlab version:

. unzip the downloaded matlab codes.

. Compile tag2mat.c and findlimits.c and put in any folder in the $PATH

.include the dfilter-folder in matlab path and start using

_________________________________________________________

Using DFilter

DFilter has been made to detect regulatory regions and enriched sites using tag count data generated by next gen. sequencing. It has been made using generalization approach so that data from multiple kinds of assays can be analyzed. Those assays can be ChIP-seq, Dnase-seq, Mnase-seq, FAIRE-seq, ChIP-exo,

input formats: bed/bam/sam/bedGraph

DFilter version1.5 can take four kinds of input files. The raw tags files can be in 6-column bed file, bedgraph, bam or sam format.

The 6 column bed files containing tags has following information

Column 1: chromosome
Column 2: tag-start
Column 3: tag-end
Column 4: user defined
Column 5: user defined
Column 6: sign + or -

For 6 column bed file there is no need to pass any hint as it is default

For bam file users should pass the type information as: -f=bam

such as

run_dfilter.sh -d=ChIP.bam -c=input-control.bam -o=peaks.bed -f=bam -ks=60

For paired-end bam or sam files additional option '-pe' should be provided otherwise it will treat the tags as single-end reads for peak calling. For paired end reads bam/sam file should be sorted according to tag-names for this purpose samtools can be used like this:

samtools sort ChIP.bam > sorted-ChIP.bam

then use Dfilter command as:

run_dfilter.sh -d=sorted-ChIP.bam -c=sorted-input-control.bam -o=peaks.bed -f=bam -pe -ks=60

For bedGraph file the command looks like :

run_dfilter.sh -d=ChIP.bedGraph -c=input-control.bedGraph -o=peaks.bed -f=bedGraph -ks=60

important: all the input tag or binned tag files should be in same format.

Note

An additional feature which makes DFilter unique, is that it can take binned tag-counts provided by user (in bedgraph format). Hence users can normalize the binned tag-count them-self before detecting enriched sites. Users could also divide binned tag count from one experiments by tags from other experiments in some special cases.

Output files

DFilter produces by default two output files. They are as such

First file: track-outfilename.bed :- this file is made for uploading to UCSC browser for visualisation purpose

Second file: outfilename.bed :- contains the detail information about called peaks with following info:

chromosome : chromosome on which peak was detected
peak-start : starting position of peak
peak-end : end position of peak
maxScore : maximum score of Hotelling filter with in peak region
peak-center : location of maximum score with in peak region
-log10(P-value) : Calculated P-value using maximum score
Norm-tag-count : average tag-count with in peak regions. Tag counts are normalized by input and sequencing depth
FDR(%) : Estimated False detection rate for maximum score of peak

making wiggle tracks

When “-wig” option is used DFilter makes wiggle track for sample tag data. It produces two kinds of

wiggle tracks which can be uploaded to UCSC browser directly or after compressing (using gzip). The first wiggle file is named after the provided chip file name such as “ChIP.bam.wig” it contains tag-count signal of raw tags in the chip file. The second wiggle is named after the output file name (such as outfilename.bed.wig ). It contains signal after normalization and applications of Hotelling filter on tag-count profile. This is the output signal on which peak detection is done finally. Users can uploaded this to visualize locations which are noisy with original data tracks.

Variable peak width, assays and optimal parameters

The most important parameter for DFilter would be to use 'zero mean' filter or 'non-zero mean' filter. For some assays the genuine sites can have wide enriched regions (such as H3K27me3, H3K9me3 H3K36me3). As well some assays, genuine tag profile has extremely narrow width such as transcription factor ChIP-seq. For such cases it is recommended to use non-zero mean kernel by using parameter '-nonzero'. The second most important parameter would be kernel width. Dfilter-version1.5 has been made sensitive enough to guess approximate kernel width. However users can provide their own filter width. For guidance users can use kernel width provided in table-1 for different assays. Remember for non-zero mean kernel keep the kernel-width below 3kb (ie -bs=100 , ks=5-30 ).

For directional regulatory regions (such as promoters) if users have training set, it is advised to use option '-dir' to increase sensitivity. However remember that if option '-dir' is used then the set of positives given to DFilter should consist of location of regulatory regions in only one direction. Such as for promoters, user should provide TSS locations of only positive strand genes.

Table-1: Suggested parameters for different cases for best results:

Notice: Dfilter output peak files can processed get the desired peaks with p-value cutoff using command on linux :

Assay- mark

Filter

type

Filter

width

DFilter command

Expected

cut-off

P-value

ChIP-seq

Histone acetylations

H3k27ac

H3BK20ac

H3K9ac

H3K14ac

and others

more focal histone methylations

H3K4me2

Zero mean

(default)

6kb

-ks=60

-bs=100

run_dfilter.sh -d=ChIP.bed -c=input-control.bed -o=peaks.bed -ks=60 -lpval=6

for input flles in bam format:::::::

run_dfilter.sh -d=ChIP.bam -c=input-control.bam -o=peaks.bed -ks=60 -std=2 -f=bam -ks=60 -lpval=6

1E-6

ChIP-seq

Wide focal

Histone methylations

H3K4me3

H3K4me1

Zero mean

(default)

8-10kb

-ks=100

-bs=100

run_dfilter.sh -d=ChIP.bed -c=input-control.bed -o=peaks.bed -ks=100 -lpval=6

1E-6

ChIP-seq

Non-focal histone modifications

H3K27me3

H3K9me3

H3K9me2

H3K36me3

CLIP-seq data with non-focal profile

Non-zero

mean

-nonzero

1.5-3kb

-kb=25

-bs=100

run_dfilter.sh -d=ChIP.bed -c=input-control.bed -o=peaks.bed -ks=20 -lpval=3 -nonzero

1E-3

ChIP-seq

Transcription

factors

RIP-seq RNA binding proteins, focal CLIP-seq

Non-zero

mean

-nonzero

0.5-2kb

-ks=20

-bs=50

run_dfilter.sh -d=ChIP.bed -c=input-control.bed -o=peaks.bed -ks=10 -lpval=6 -nonzero -refine

1E-6

Open-chromatin

assays

Dnase-seq

FAIRE-seq

Mnase-seq

Zero mean

(default)

4-5kb

-ks=50

-bs=100

run_dfilter.sh -d=Dnase-seq.bed -o=peaks.bed -ks=50 -lpval=2

1E-2

Combining multiple kinds of data

A unique feature in DFilter which makes it more generalized tool is to combine tag profiles from different assays to find regulatory elements. This feature of DFilter is useful when 2 or more libraries

are representative of same regulatory element, such as multiple histone acetylations at active enhancers or different assays to infer open-chromatin.

Some examples

for detecting active promoters using histone modifications

run_dfilter.sh -d=H3K4me3-tags.bed,H4K20me1-tags.bed -c=control-tags.bed -o=out.bed -lpval=6 -pos=promoters.bed -dir

or more than one ChIP-seq to support TF site prediction:

run_dfilter.sh -d=oct4-tags.bed, sox2-tags.bed -c=control-tags.bed -o=out.bed -nonzero -bs=50 -ks=30 -pm=300

for inferring open-chromatin from multiple assays

run_dfilter.sh -d=Dnase-seq.bed,FAIRE-seq.bed,ATAC-seq.bed -o=open-chromatin.bed -bs =100 -ks=50 -lpval=2

More detail option wise explanation:

Contact your system administrator to have this file installed.

-d=ChIPfiles filenames of sample, containing tags from high throughput sequencing. More than one filename can be given. The files can be in 6-column bed format,bam, sam or bedGraph format.

-c=INPUTfiles filenames of control input tags, more than one filename is given. same format as ChIPfile if more than one input control file is given then it should be equal to number of sample filename. (optional)

-o=OUPUTfile single filename to save predicted regions (compulsory)

-lpval=value a floating point number to decide threshold to detect region after calculating scores. Default is -lpval=5 ( threshold P-value =1E-5 ) (optional)

-f=bam/sam/bed/bedgraph the format of input either it is bam, sam,bedGraph or bed (6 column bed file) default input is assumed to be in bed format

-pos=bedFile the name of 3 column bed file containing set of positives (optional) .

( When some possible locations are known in advance they can be given as training set)

-bs=basepair bin size in form of base pair, default=100 basepairs (optional)

-ks=kernel_size size of kernel in form of bins. default=100 bin (optional)

-nonzero to use kernels with non-zero mean (optional), by default it uses zero mean kernel

for pattern recognition. non-zero mean kernel is good for transcription factors ChIP-seq or signals with variable width like H3K27me3

-wig to save wiggle track of raw signal and filtered signal, resolution same as bin-size

-pe if the tags are paired end. For paired-end read provide BAM/SAM file

(remember to sort tags for paried end mode according to name before passing as input)

-tagsize=basepair the size of tags in basepairs. Default is 200bp. for automatic finding of tagsize use "-tagsize=auto"

-refine to refine the center of detected region

-tw=basepairs truncate wider peaks which are more wide than given basepair width. (optional)

-redund=N N is the number of tags allowed with same orientation and location Default=1

-pm=basepairs merges the regions which are closer than given basepair distance. (optional)

-dir to run dfilter taking directionality of regultory site such as promoters (optional)

Using EFilter

EFilter is signal estimation tool which combines tag-count signals from multiple assays or ChIP-seq to predict gene expression. With current version (v1.5) it is possible to use any ChIP-seq, Dnase-seq data-set. The current version of EFilter finds the rank-invariant genes by predicting expression in first round and comparing with its own expression profiles. Hence now there is no need to give information about ChIP-seq like old version. The current version is also made such as users can opt for not doing gene-specific correction of expression (using option “-nocorr”). By default EFilter does correction for gene-specific biases by using pre-calculated residuals. When we do not perform gene-specific correction we expect the predicted expression to be more suitable for motif based analysis, such as inferring gene-networks.

-d=ChIPfiles filenames of sample, more than one filename can be given. The input file containing tags from high throughput sequencing. (compulsory)

-c=INPUTfiles filenames of control input tags, more than one filename can be given.'

if more than one input control file is given then it should be equal to number of sample filename. (optional)'

-f=bam /sam/ bed format of the input tag file it can be bam , bed or sam file. The default is 6 column bed format.

-g=geneFile the downloaded geneFile (compulsory)

-m=matrixFile the downloaded matrixFile (compulsory)'

-o=output output file (compulsory)'

-nocorr This option is used when no correction is done for gene specific biases

it will reduce correlation with actual expression but increase correlation with motif occurrences (optional)'