BAM Processing

Processing BAM files for barcode extraction and classification

Introduction

This notebook implements the BAM processing functionality for BarcodeSeqKit. It includes functions for handling BAM files, extracting barcodes, and sorting reads into appropriate output files based on barcode detection.

BAM File Utilities

First, let’s define utility functions for BAM file operations.


source

BamUtils

 BamUtils ()

Utility functions for BAM file operations.

Softclip Extraction Function


source

extract_softclipped_region

 extract_softclipped_region (read)

*Extracts softclipped regions from an alignment. For + strand: gets softclipped region at 5’ end of read For - strand: gets softclipped region at 3’ end of read

Args: read: A pysam.AlignedSegment object

Returns: str: Softclipped sequence or empty string if none exists*

BAM Processor


source

process_bam_file

 process_bam_file (config:BarcodeSeqKit.core.BarcodeExtractorConfig,
                   bam_file:str)

*Process a BAM file to extract barcodes.

Args: config: Barcode extractor configuration bam_file: Path to the input BAM file

Returns: Statistics from the extraction process*

Helper Functions


source

prepare_bam_categories

 prepare_bam_categories (barcodes:List[BarcodeSeqKit.core.BarcodeConfig],
                         single_barcode_mode:bool)

*Prepare output categories based on barcodes.

Args: barcodes: List of barcode configurations single_barcode_mode: Whether we’re in single barcode mode

Returns: List of category names*


source

classify_read

 classify_read (sequence:str,
                barcodes:List[BarcodeSeqKit.core.BarcodeConfig],
                max_mismatches:int, single_barcode_mode:bool)

*Classify a read sequence based on barcode matches.

Args: sequence: Read sequence to classify barcodes: List of barcode configurations max_mismatches: Maximum number of mismatches allowed single_barcode_mode: Whether we’re in single barcode mode

Returns: Tuple of (best_match, category)*


source

get_output_path

 get_output_path (output_prefix:str, output_dir:str, category:str)

*Get the output path for a category.

Args: output_prefix: Prefix for output files output_dir: Directory for output files category: Category name

Returns: Path to the output BAM file*


source

save_statistics

 save_statistics (stats:BarcodeSeqKit.core.ExtractionStatistics,
                  output_prefix:str, output_dir:str)

*Save extraction statistics to files.

Args: stats: Extraction statistics output_prefix: Prefix for output files output_dir: Directory for output files*

Example Usage

Let’s demonstrate how to use these functions.

# Example usage with test.bam
from BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationType, BarcodeExtractorConfig

# Define barcodes to search for
barcodes = [
    BarcodeConfig(
        sequence="TAACTGAGGCCGGC",  # 3' barcode 
        location=BarcodeLocationType.THREE_PRIME,
        name="3prime",
        description="3' barcode from test data"
    ),
    BarcodeConfig(
        sequence="CTGACTCCTTAAGGGCC",  # 5' barcode
        location=BarcodeLocationType.FIVE_PRIME,
        name="5prime",
        description="5' barcode from test data"
    )
]

# Create a configuration
output_dir = "../tests/bam_output"
os.makedirs(output_dir, exist_ok=True)

config = BarcodeExtractorConfig(
    barcodes=barcodes,
    output_prefix="test_extraction",
    output_dir=output_dir,
    max_mismatches=0,
    search_softclipped=True,  # Search in softclipped regions for barcodes
    verbose=True
)

# Path to test BAM file
test_bam = "../tests/test.bam"

if os.path.exists(test_bam):
    print(f"Processing {test_bam}")
    
    # Process the BAM file
    stats = process_bam_file(config, test_bam)
    
    # Print results
    print(f"\nTotal reads: {stats.total_reads}")
    print(f"Total barcode matches: {stats.total_barcode_matches}")
    
    for barcode_name, count in stats.matches_by_barcode.items():
        print(f"  {barcode_name}: {count} matches")
    
    for orientation, count in stats.matches_by_orientation.items():
        print(f"  Orientation {orientation}: {count} matches")
    
    for category, count in stats.matches_by_category.items():
        print(f"  Category {category}: {count} matches")
    
    # List the output files
    output_files = [f for f in os.listdir(output_dir) if f.startswith("test_extraction_") and f.endswith(".bam")]
    print("\nOutput files:")
    for f in output_files:
        path = os.path.join(output_dir, f)
        size = os.path.getsize(path)
        print(f"  {f} ({size} bytes)")
        try:
            # Use BamUtils to get read count for each output file
            read_count = BamUtils.get_read_count(path)
            print(f"  {f} ({read_count} alignments)")
        except Exception as e:
            # Fallback to file size if read count fails
            size = os.path.getsize(path)
            print(f"  {f} ({size} bytes) - Error getting alignment count: {str(e)}")
2025-03-24 13:55:29,817 - BarcodeSeqKit - INFO - BAM file: ../tests/test.bam (498 reads)
2025-03-24 13:55:29,818 - BarcodeSeqKit - INFO - Output categories: ['barcode3_orientFR', 'barcode3_orientRC', 'barcode5_orientFR', 'barcode5_orientRC', 'noBarcode']
Processing ../tests/test.bam
2025-03-24 13:55:29,851 - BarcodeSeqKit - INFO - First pass complete: classified 18 reads
2025-03-24 13:55:29,878 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_barcode3_orientFR.bam
2025-03-24 13:55:29,894 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_barcode3_orientRC.bam
2025-03-24 13:55:29,902 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_barcode5_orientFR.bam
2025-03-24 13:55:29,912 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_barcode5_orientRC.bam
2025-03-24 13:55:29,920 - BarcodeSeqKit - INFO - Sorting and indexing ../tests/bam_output/test_extraction_noBarcode.bam

Total reads: 498
Total barcode matches: 18
  5prime: 10 matches
  3prime: 8 matches
  Orientation FR: 10 matches
  Orientation RC: 8 matches
  Category barcode5_orientFR: 7 matches
  Category barcode3_orientFR: 3 matches
  Category barcode3_orientRC: 5 matches
  Category barcode5_orientRC: 3 matches

Output files:
  test_extraction_barcode5_orientRC.bam (5998 bytes)
  test_extraction_barcode5_orientRC.bam (6 alignments)
  test_extraction_barcode5_orientFR.bam (6392 bytes)
  test_extraction_barcode5_orientFR.bam (14 alignments)
  test_extraction_barcode3_orientRC.bam (6069 bytes)
  test_extraction_barcode3_orientRC.bam (10 alignments)
  test_extraction_barcode3_orientFR.bam (5928 bytes)
  test_extraction_barcode3_orientFR.bam (6 alignments)
  test_extraction_noBarcode.bam (26763 bytes)
  test_extraction_noBarcode.bam (462 alignments)

Conclusion

This notebook implements the BAM processing functionality for BarcodeSeqKit. It provides efficient extraction and classification of barcoded reads from BAM files without using abstract classes, making the code more direct and easier to understand.