Utilities for working with DNA/RNA sequences and barcode detection
Introduction
This notebook contains utility functions for sequence manipulation, barcode detection, and quality assessment in the BarcodeSeqKit library. These utilities are format-agnostic and can be used with both BAM and FASTQ data.
Args: sequence: The DNA/RNA sequence to search in barcodes: List of barcode configurations to search for max_mismatches: Maximum number of mismatches to allow
Returns: List of BarcodeMatch objects representing the matches found*
Args: match: Barcode match or None single_barcode_mode: Whether we’re in single barcode mode
Returns: Category string for output file naming*
Example Usage
Let’s demonstrate how to use these utilities.
# Create test barcode configurationsfrom BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationTypebarcode_5prime = BarcodeConfig( sequence="TCGCGAGGC", location=BarcodeLocationType.FIVE_PRIME, name="5", description="5' barcode for test")barcode_3prime = BarcodeConfig( sequence="GGCCGGCCGG", location=BarcodeLocationType.THREE_PRIME, name="3", description="3' barcode for test")# Example sequence with barcodessequence ="AAAAAATCGCGAGGCAAAAAAAGGCCGGCCGGAAAAAA"print(f"Test sequence: {sequence}")# Find all barcode matchesmatches = find_barcode_matches(sequence, [barcode_5prime, barcode_3prime])print(f"Found {len(matches)} matches:")for match in matches:print(f" {match}")print(f" Barcode: {match.barcode.name}")print(f" Orientation: {match.orientation.value}")print(f" Position: {match.position}")print(f" Sequence: {match.sequence}")# Classify a read using first matchprint("\nClassifying reads:")for test_seq in [sequence, "AAAAAAGCCTCGCGAAAAAAA", # 5' barcode with mismatch"AAAAAAGGCCGGCCTGAAAAAA"]: # 3' barcode with mismatch match, category = classify_read_by_first_match( sequence=test_seq, barcodes=[barcode_5prime, barcode_3prime], max_mismatches=1 )print(f"Sequence: {test_seq}")print(f" Match: {match}")print(f" Category: {category}")
Test sequence: AAAAAATCGCGAGGCAAAAAAAGGCCGGCCGGAAAAAA
Found 2 matches:
5 (FR) at position 6
Barcode: 5
Orientation: FR
Position: 6
Sequence: TCGCGAGGC
3 (FR) at position 22
Barcode: 3
Orientation: FR
Position: 22
Sequence: GGCCGGCCGG
Classifying reads:
Sequence: AAAAAATCGCGAGGCAAAAAAAGGCCGGCCGGAAAAAA
Match: 5 (FR) at position 5
Category: barcode5_orientFR
Sequence: AAAAAAGCCTCGCGAAAAAAA
Match: 5 (RC) at position 5
Category: barcode5_orientRC
Sequence: AAAAAAGGCCGGCCTGAAAAAA
Match: 3 (FR) at position 6
Category: barcode3_orientFR
# Test with real data import osimport pysamfrom tqdm.auto import tqdmfrom BarcodeSeqKit.core import ExtractionStatistics# Path to the test file (adjust if needed)bam_file ="../tests/test.bam"if os.path.exists(bam_file):print(f"Testing with {bam_file}") stats = ExtractionStatistics()# Define barcodes to search for example_barcodes = [ BarcodeConfig( sequence="TAACTGAGGCCGGC", # Example barcode to search for location=BarcodeLocationType.THREE_PRIME, name="3prime", description="3' barcode from test data" ), BarcodeConfig( sequence="CTGACTCCTTAAGGGCC", # Example barcode to search for location=BarcodeLocationType.FIVE_PRIME, name="5prime", description="5' barcode from test data" ) ]# Count matcheswith pysam.AlignmentFile(bam_file, "rb") as bam:for read in tqdm(bam): stats.total_reads +=1 sequence = read.query_sequenceif sequence: match, category = classify_read_by_first_match( sequence=sequence, barcodes=example_barcodes, max_mismatches=0 )if match: stats.update_barcode_match(match, category)# Print statisticsprint("\nBarcode detection statistics:")print(f"Total reads: {stats.total_reads}")print(f"Total matches: {stats.total_barcode_matches}")for barcode_name, count in stats.matches_by_barcode.items():print(f" {barcode_name}: {count} matches")for orientation, count in stats.matches_by_orientation.items():print(f" Orientation {orientation}: {count} matches")for category, count in stats.matches_by_category.items():print(f" Category {category}: {count} matches")else:print(f"Test file not found: {bam_file}")
Test file not found: ../tests/test.bam
Conclusion
This notebook provides utility functions for sequence manipulation, barcode detection, and classification in BarcodeSeqKit. These functions are used by both the BAM and FASTQ processing modules to identify and categorize barcoded reads.