Extract and classify sequences based on barcode presence in BAM and FASTQ files
Key Features
Simple yet powerful: Extract barcodes from BAM or FASTQ files with minimal code
Support for both file types: Process BAM files (including softclipped regions) and FASTQ files (including paired-end data)
Flexible barcode options: Use single barcodes or specific 5’/3’ combinations
Orientation detection: Identify barcodes in both forward and reverse complement orientations
Fuzzy matching: Configure allowable mismatches for barcode detection
Specialized functions: Search in softclipped regions of BAM alignments or both reads in paired FASTQ data
Detailed statistics: Get comprehensive reports on barcode matches and distribution
Regular Expression Support in Barcode Matching
BarcodeSeqKit uses Python’s regular expression engine (re module) for exact barcode matching, which means you can leverage the full power of regular expressions in your barcode patterns.
This command extracts reads containing the specified barcode (in either forward or reverse complement orientation) and creates: - results/single_barcode_barcode_orientFR.bam: Forward orientation matches - results/single_barcode_barcode_orientRC.bam: Reverse complement matches - results/single_barcode_extraction_stats.json: Detailed statistics in JSON format - results/single_barcode_extraction_stats.tsv: Detailed statistics in TSV format
This creates separate files for each barcode and orientation combination: - results/dual_barcode_barcode5_orientFR.bam - results/dual_barcode_barcode5_orientRC.bam - results/dual_barcode_barcode3_orientFR.bam - results/dual_barcode_barcode3_orientRC.bam
from BarcodeSeqKit.fastq_processing import process_fastq_files# Use the same config as abovefastq_files = ["tests/test.1.fastq.gz", "tests/test.2.fastq.gz"]stats = process_fastq_files( config=config, fastq_files=fastq_files, compress_output=True, search_both_reads=True)
Key Concepts
Barcode Types
BarcodeSeqKit supports three types of barcode configurations:
Generic barcodes: Use these when you just want to find a specific sequence regardless of location
5’ barcodes: Use these when the barcode is expected at the 5’ end of the sequence
3’ barcodes: Use these when the barcode is expected at the 3’ end of the sequence
Barcode Orientations
For each barcode, BarcodeSeqKit tracks two possible orientations:
Forward (FR): The barcode appears in its specified sequence
Reverse Complement (RC): The barcode appears as its reverse complement
Softclipped Regions
When working with BAM files, the --search-softclipped option examines only the softclipped portions of reads:
For forward strand reads (+): Examines the 5’ softclipped region
For reverse strand reads (-): Examines the 3’ softclipped region
This is especially useful for splice leader sequences in trypanosomatids or where barcodes are clipped during alignment.
Advanced Options
Command-Line Arguments
BarcodeSeqKit offers a range of options to customize your extraction:
Option
Description
--max-mismatches N
Allow up to N mismatches in barcode detection
--search-softclipped
Search in softclipped regions (BAM only)
--search-both-reads
Look for barcodes in both reads of paired FASTQ files
--no-compress
Disable compression for FASTQ output files
--verbose
Enable detailed logging
For a complete list, run barcodeseqkit --help.
Barcode Configuration Files
For complex projects with multiple barcodes, you can use a YAML configuration file:
barcodes:-sequence: CTGACTCCTTAAGGGCClocation:5name: 5primedescription: 5' barcode for experiment X-sequence: TAACTGAGGCCGGClocation:3name: 3primedescription: 3' barcode for experiment X
In BarcodeSeqKit, when multiple barcodes are provided, the program uses an efficient approach: it parses the input file(s) only once while searching for all barcodes simultaneously during that single pass.
Output Files and Statistics
BarcodeSeqKit generates:
Categorized output files: BAM or FASTQ files containing reads matching specific barcode/orientation combinations
Statistics in JSON format: Detailed machine-readable statistics
Statistics in TSV format: Human-readable tabular statistics
The statistics include: - Total number of reads processed - Total barcode matches found - Match counts by barcode type - Match counts by orientation - Match counts by category - Overall match rate
Statistics-Only Mode
BarcodeSeqKit now supports a “statistics-only” mode that processes files and generates detailed statistics without writing output sequence files. This feature is particularly useful for:
Quickly analyzing barcode distributions in large datasets
Performing QC checks before committing to full processing
Estimating barcode frequencies without using additional disk space
Benchmarking and optimization tasks
Command-line Usage
To use statistics-only mode from the command line, add the --only-stats flag:
Input BAM file: ../tests/test.bam
Using 5' barcode with sequence: CTGACTCCTTAAGGGCC
Using 3' barcode with sequence: TAACTGAGGCCGGC
Saved configuration to ../tests/index_cli/dual_barcode_config.yaml
2025-03-17 12:18:29,756 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-17 12:18:29,756 - BarcodeSeqKit - INFO - Output categories: ['barcode5_orientFR', 'barcode5_orientRC', 'barcode3_orientFR', 'barcode3_orientRC', 'noBarcode']
Classifying reads: 100%|███████████████████| 498/498 [00:00<00:00, 58040.55it/s]
2025-03-17 12:18:29,810 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads
Writing reads: 100%|██████████████████████| 498/498 [00:00<00:00, 230344.44it/s]
2025-03-17 12:18:29,816 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode5_orientFR.bam
2025-03-17 12:18:29,844 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode5_orientRC.bam
2025-03-17 12:18:29,856 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode3_orientFR.bam
2025-03-17 12:18:29,882 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_barcode3_orientRC.bam
2025-03-17 12:18:29,894 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/index_cli/dual_barcode_noBarcode.bam
Extraction complete
Using BarcodeSeqKit with Containers
BarcodeSeqKit is available as a containerized application, which allows you to run it without installing any dependencies on your system. This guide explains how to use BarcodeSeqKit with Docker and Singularity containers.
The container includes a test BAM file at /app/tests/test.bam. You can run BarcodeSeqKit on this test file and save the results to your local machine:
# Create a directory for the resultsmkdir-p results# Run the container with the included test filedocker run --rm\-v$(pwd)/results:/output \ mtinti/barcodeseqkit:0.0.4 \--bam /app/tests/test.bam \--barcode5 CTGACTCCTTAAGGGCC \--barcode3 TAACTGAGGCCGGC \--output-prefix test_extraction \--output-dir /output \--search-softclipped\--verbose
This command: - Uses the test BAM file already included in the container - Mounts your local results directory to /output inside the container - Extracts reads matching the specified 5’ and 3’ barcodes - Saves the results in your local results directory
BarcodeSeqKit provides a streamlined, user-friendly approach to barcode extraction from sequencing data. With its intuitive interface and flexible options, it’s suitable for a wide range of applications, from simple barcode detection to complex multi-barcode analyses.
Whether you’re working with BAM files, FASTQ files, single barcodes, or multiple barcodes with specific locations, BarcodeSeqKit offers a straightforward solution for your barcode extraction needs.