Introduction
This notebook implements the BAM processing functionality for BarcodeSeqKit. It includes functions for handling BAM files, extracting barcodes, and sorting reads into appropriate output files based on barcode detection.
BAM File Utilities
First, let’s define utility functions for BAM file operations.
source
BamUtils
BamUtils ()
Utility functions for BAM file operations.
BAM Processor
source
process_bam_file
process_bam_file (config:BarcodeSeqKit.core.BarcodeExtractorConfig,
bam_file:str)
*Process a BAM file to extract barcodes.
Args: config: Barcode extractor configuration bam_file: Path to the input BAM file
Returns: Statistics from the extraction process*
Helper Functions
source
prepare_bam_categories
prepare_bam_categories (barcodes:List[BarcodeSeqKit.core.BarcodeConfig],
single_barcode_mode:bool)
*Prepare output categories based on barcodes.
Args: barcodes: List of barcode configurations single_barcode_mode: Whether we’re in single barcode mode
Returns: List of category names*
source
classify_read
classify_read (sequence:str,
barcodes:List[BarcodeSeqKit.core.BarcodeConfig],
max_mismatches:int, single_barcode_mode:bool)
*Classify a read sequence based on barcode matches.
Args: sequence: Read sequence to classify barcodes: List of barcode configurations max_mismatches: Maximum number of mismatches allowed single_barcode_mode: Whether we’re in single barcode mode
Returns: Tuple of (best_match, category)*
source
get_output_path
get_output_path (output_prefix:str, output_dir:str, category:str)
*Get the output path for a category.
Args: output_prefix: Prefix for output files output_dir: Directory for output files category: Category name
Returns: Path to the output BAM file*
source
save_statistics
save_statistics (stats:BarcodeSeqKit.core.ExtractionStatistics,
output_prefix:str, output_dir:str)
*Save extraction statistics to files.
Args: stats: Extraction statistics output_prefix: Prefix for output files output_dir: Directory for output files*
Example Usage
Let’s demonstrate how to use these functions.
# Example usage with test.bam
from BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationType, BarcodeExtractorConfig
# Define barcodes to search for
barcodes = [
BarcodeConfig(
sequence= "TAACTGAGGCCGGC" , # 3' barcode
location= BarcodeLocationType.THREE_PRIME,
name= "3prime" ,
description= "3' barcode from test data"
),
BarcodeConfig(
sequence= "CTGACTCCTTAAGGGCC" , # 5' barcode
location= BarcodeLocationType.FIVE_PRIME,
name= "5prime" ,
description= "5' barcode from test data"
)
]
# Create a configuration
output_dir = "../tests/bam_output"
os.makedirs(output_dir, exist_ok= True )
config = BarcodeExtractorConfig(
barcodes= barcodes,
output_prefix= "test_extraction" ,
output_dir= output_dir,
max_mismatches= 0 ,
search_softclipped= True , # Search in softclipped regions for barcodes
verbose= True
)
# Path to test BAM file
test_bam = "../tests/test.bam"
if os.path.exists(test_bam):
print (f"Processing { test_bam} " )
# Process the BAM file
stats = process_bam_file(config, test_bam)
# Print results
print (f" \n Total reads: { stats. total_reads} " )
print (f"Total barcode matches: { stats. total_barcode_matches} " )
for barcode_name, count in stats.matches_by_barcode.items():
print (f" { barcode_name} : { count} matches" )
for orientation, count in stats.matches_by_orientation.items():
print (f" Orientation { orientation} : { count} matches" )
for category, count in stats.matches_by_category.items():
print (f" Category { category} : { count} matches" )
# List the output files
output_files = [f for f in os.listdir(output_dir) if f.startswith("test_extraction_" ) and f.endswith(".bam" )]
print (" \n Output files:" )
for f in output_files:
path = os.path.join(output_dir, f)
size = os.path.getsize(path)
print (f" { f} ( { size} bytes)" )
try :
# Use BamUtils to get read count for each output file
read_count = BamUtils.get_read_count(path)
print (f" { f} ( { read_count} alignments)" )
except Exception as e:
# Fallback to file size if read count fails
size = os.path.getsize(path)
print (f" { f} ( { size} bytes) - Error getting alignment count: { str (e)} " )
2025-03-24 13:55:29,817 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-24 13:55:29,818 - BarcodeSeqKit - INFO - Output categories: ['barcode3_orientFR', 'barcode3_orientRC', 'barcode5_orientFR', 'barcode5_orientRC', 'noBarcode']
Processing ../tests/test.bam
2025-03-24 13:55:29,851 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads
2025-03-24 13:55:29,878 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_barcode3_orientFR.bam
2025-03-24 13:55:29,894 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_barcode3_orientRC.bam
2025-03-24 13:55:29,902 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_barcode5_orientFR.bam
2025-03-24 13:55:29,912 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_barcode5_orientRC.bam
2025-03-24 13:55:29,920 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_noBarcode.bam
Total reads: 498
Total barcode matches: 18
5prime: 10 matches
3prime: 8 matches
Orientation FR: 10 matches
Orientation RC: 8 matches
Category barcode5_orientFR: 7 matches
Category barcode3_orientFR: 3 matches
Category barcode3_orientRC: 5 matches
Category barcode5_orientRC: 3 matches
Output files:
test_extraction_barcode5_orientRC.bam (5998 bytes)
test_extraction_barcode5_orientRC.bam (6 alignments)
test_extraction_barcode5_orientFR.bam (6392 bytes)
test_extraction_barcode5_orientFR.bam (14 alignments)
test_extraction_barcode3_orientRC.bam (6069 bytes)
test_extraction_barcode3_orientRC.bam (10 alignments)
test_extraction_barcode3_orientFR.bam (5928 bytes)
test_extraction_barcode3_orientFR.bam (6 alignments)
test_extraction_noBarcode.bam (26763 bytes)
test_extraction_noBarcode.bam (462 alignments)
Conclusion
This notebook implements the BAM processing functionality for BarcodeSeqKit. It provides efficient extraction and classification of barcoded reads from BAM files without using abstract classes, making the code more direct and easier to understand.