top of page

Manipulating files with bedtools


Bedtools is a suite of tools for working with genomic intervals, which are ranges along a DNA sequence. These intervals are often stored in GFF or bed files.


Bed files are tab-delimited text files that contain information about genomic intervals. They have a fixed set of fields, including the chromosome or scaffold the interval is on, the start and end positions of the interval, and other optional fields such as the name of the interval and its strand. Bed files are 0-based, meaning that the first base of a chromosome is numbered 0. The end position is 1-based, meaning that it is the first base that is not included in the interval.


The following fields can be found inside a bed file, of which only the first three fields are required.

Column

Field

Meaning

1

seqname

chr or scaffold id interval is on

2

start

start position (0-based) of interval

3

end

end position (1-based) of interval

4

name

name of interval

5

score

score, e.g. for peak calls or other intervals with scores; can be .

6

strand

strand interval is on for strand-specific intervals

7

thickStart

graphical parameter for genome browser display

8

thickEnd

graphical parameter for genome browser display

9

itemRGB

graphical parameter for genome browser display

10

blockCount

for multi-exon genes and similar

11

blockSizes

for multi-exon genes and similar

12

blockStarts

for multi-exon genes and similar

Finding closest interval

One common task in bioinformatics is finding the closest interval from one file to intervals in another file. This can be done using the "closest" option in bedtools. For example, you might have a list of putative enhancers in a bed file and want to find the closest gene to each one. You could use the following command:

bedtools closest -a enhancers.bed -b example.bed

This would compare each element in the "enhancers.bed" file to all elements in the "example.bed" file, and output the closest gene to each enhancer. You can also use the -d option to include the distance between the elements in the output.


Overlapping intervals

Another useful feature of bedtools is its ability to merge overlapping intervals. For example, you might have a bed file with multiple intervals that overlap each other. You can use the "merge" option to combine these intervals into a single interval. For example:

bedtools merge -i example.bed

This would merge all overlapping intervals in the "example.bed" file into a single interval.


Intersecting intervals

Sometimes you might have two bed files and want to find the intervals that overlap between them. For example, you might have a bed file of gene models and another bed file of peak calls from ChIP-seq data, and want to find the genes that are associated with the peaks. You can use the "intersect" option in bedtools to do this:

bedtools intersect -a gene_models.bed -b chipseq_peaks.bed

This would output the intervals in the "gene_models.bed" file that overlap with intervals in the "chipseq_peaks.bed" file.


To include the overlapping intervals from "chipseq_peaks.bed" in the output, you can use the "-wao" option. For example:

bedtools intersect -a gene_models.bed -b chipseq_peaks.bed -wao

This would output the intervals from "gene_models.bed" with overlaps in "chipseq_peaks.bed", along with the overlapping intervals from "chipseq_peaks.bed" and the number of base pairs that overlap.


If you want to split "gene_models.bed" into two separate files, one for intervals with overlaps in "chipseq_peaks.bed" and one for intervals with no overlaps, you can use the "-wa" and "-v" options. The "-wa" option writes the original intervals from "gene_models.bed" if they overlap any interval in "chipseq_peaks.bed", and the "-v" option only includes intervals from "gene_models.bed" with no overlap in "chipseq_peaks.bed". For example:

bedtools intersect -a gene_models.bed -b chipseq_peaks.bed -wa -v

This would output the intervals from "gene_models.bed" with no overlaps in "chipseq_peaks.bed".


To get the intervals with overlaps, you can use the "-wa" option without "-v":

bedtools intersect -a gene_models.bed -b chipseq_peaks.bed -wa

Bedtools intersect has many other options for fine-tuning the overlap criteria and for manipulating the output.


Finding interval length

You might also want to know the length of each interval in a bed file. You can use the "bedtools genomecov" command to do this, with the "-bg" option to output the data in bedGraph format:

bedtools genomecov -bg -i data/intervals.bed

This would output the start and end positions of each interval in the "intervals.bed" file, as well as the length of each interval.


Promoter intervals

Another useful feature of bedtools is the "flank" command, which creates a new interval adjacent to existing intervals in a bed file. This can be used to compute promoter intervals, for example, by defining promoters as the region 2kb upstream of the start site of each gene.


To compute promoter intervals with bedtools, you can use the following command:

bedtools flank -i genes.bed -g genome.txt -l 2000 -r 0 -s

This would take the intervals in "genes.bed" and create a new interval 2kb upstream of the start site of each interval. The "-g" option specifies a genome file with information about the size of each chromosome or scaffold, and the "-s" option specifies that the intervals in "genes.bed" are on the forward strand. You can also use the "-l" and "-r" options to specify the length of the interval to the left and right of the original intervals, respectively.


You can combine the "flank" command with other bedtools options using the Unix "|" operator, which allows the output of one tool to be used as the input for another. For example, you can use the "intersect" option to find the intervals in "enhancers.bed" that overlap with the promoter intervals computed with "flank":

bedtools flank -i genes.bed -g genome.txt -l 2000 -r 0 -s | bedtools intersect -a enhancers.bed -b -

This would output the intervals from "enhancers.bed" that overlap with the promoter intervals computed from "genes.bed". You can also use the "-u" option to output the intervals from "enhancers.bed" that do not overlap with the promoter intervals.


Sorting intervals

Bedtools also has an option for sorting intervals by their start position. This can be useful if you have a bed file that is not sorted and you want to process it in order. You can use the "sort" command to do this:

bedtools sort -i unsorted_intervals.bed > sorted_intervals.bed

This would sort the intervals in the "unsorted_intervals.bed" file by their start position and output the sorted intervals to a new file called "sorted_intervals.bed".


Bioinformatics one-liners

Here are a few examples of common tasks that can be performed with bedtools:

  • Making a bed file of intergenic regions: You can use the "complement" option to create a bed file of intervals that do not overlap with any intervals in a reference bed file. For example:

bedtools complement -i genes.bed -g genome.txt

This would create a bed file of intergenic regions based on the intervals in "genes.bed" and the genome information in "genome.txt".


  • Randomly shuffling the location of intervals: You can use the "shuffle" option to randomly shuffle the location of intervals in a bed file, while avoiding placing them in certain regions. For example:

bedtools shuffle -i enhancers.bed -g genome.txt -excl genes.bed

This would shuffle the location of the intervals in "enhancers.bed" while avoiding placing them in the intervals in "genes.bed".


  • Computing a metric to measure similarity between two sets of intervals: You can use the "jaccard" option to compute a single metric to measure the similarity between two sets of intervals. For example:

bedtools flank -i genes.bed -g genome.txt -l 2000 -r 0 -s | bedtools sort -i - | bedtools jaccard -a data/peaks.bed -b -

This would compute the Jaccard index


Comments


bottom of page