Sequence Utilities

Utilities for working with DNA/RNA sequences and barcode detection

Introduction

This notebook contains utility functions for sequence manipulation, barcode detection, and quality assessment in the BarcodeSeqKit library. These utilities are format-agnostic and can be used with both BAM and FASTQ data.

Basic Sequence Operations

First, let’s define basic sequence manipulation functions.


source

reverse_complement

 reverse_complement (sequence:str)

*Return the reverse complement of a DNA sequence.

Args: sequence: DNA sequence

Returns: Reverse complement of the sequence*


source

hamming_distance

 hamming_distance (seq1:str, seq2:str)

*Calculate the Hamming distance between two sequences.

The sequences must be of the same length.

Args: seq1: First sequence seq2: Second sequence

Returns: Hamming distance (number of positions where the sequences differ)

Raises: ValueError: If sequences have different lengths*

Barcode Detection

Now let’s implement functions for detecting barcodes in sequences.


source

find_barcode_matches

 find_barcode_matches (sequence:str,
                       barcodes:List[BarcodeSeqKit.core.BarcodeConfig],
                       max_mismatches:int=0)

*Find all barcode matches in a sequence.

Args: sequence: The DNA/RNA sequence to search in barcodes: List of barcode configurations to search for max_mismatches: Maximum number of mismatches to allow

Returns: List of BarcodeMatch objects representing the matches found*


source

find_best_barcode_match

 find_best_barcode_match (sequence:str,
                          barcodes:List[BarcodeSeqKit.core.BarcodeConfig],
                          max_mismatches:int=1)

*Find the best matching barcode in a sequence.

Args: sequence: The sequence to search in barcodes: List of barcode configurations to search for max_mismatches: Maximum number of mismatches to allow

Returns: The best matching BarcodeMatch or None if no match found*


source

classify_read_by_first_match

 classify_read_by_first_match (sequence:str,
                               barcodes:List[BarcodeSeqKit.core.BarcodeCon
                               fig], max_mismatches:int=0)

*Classify a read based on the first barcode match found.

This is an optimized version that stops after finding the first match, without evaluating all possible matches in the sequence.

Args: sequence: The sequence to classify barcodes: List of barcode configurations to search for max_mismatches: Maximum number of mismatches to allow

Returns: Tuple of (match, category) Category is one of: “barcode5_orientFR”, “barcode5_orientRC”, etc. or “noBarcode” if no match found*

Category and Output File Handling


source

get_output_category

 get_output_category (match:Optional[BarcodeSeqKit.core.BarcodeMatch],
                      single_barcode_mode:bool=False)

*Get the output category for a barcode match.

Args: match: Barcode match or None single_barcode_mode: Whether we’re in single barcode mode

Returns: Category string for output file naming*

Example Usage

Let’s demonstrate how to use these utilities.

# Create test barcode configurations
from BarcodeSeqKit.core import BarcodeConfig, BarcodeLocationType

barcode_5prime = BarcodeConfig(
    sequence="TCGCGAGGC",
    location=BarcodeLocationType.FIVE_PRIME,
    name="5",
    description="5' barcode for test"
)

barcode_3prime = BarcodeConfig(
    sequence="GGCCGGCCGG",
    location=BarcodeLocationType.THREE_PRIME,
    name="3",
    description="3' barcode for test"
)

# Example sequence with barcodes
sequence = "AAAAAATCGCGAGGCAAAAAAAGGCCGGCCGGAAAAAA"
print(f"Test sequence: {sequence}")

# Find all barcode matches
matches = find_barcode_matches(sequence, [barcode_5prime, barcode_3prime])
print(f"Found {len(matches)} matches:")
for match in matches:
    print(f"  {match}")
    print(f"    Barcode: {match.barcode.name}")
    print(f"    Orientation: {match.orientation.value}")
    print(f"    Position: {match.position}")
    print(f"    Sequence: {match.sequence}")

# Classify a read using first match
print("\nClassifying reads:")
for test_seq in [sequence, 
                "AAAAAAGCCTCGCGAAAAAAA",  # 5' barcode with mismatch
                "AAAAAAGGCCGGCCTGAAAAAA"]: # 3' barcode with mismatch
    match, category = classify_read_by_first_match(
        sequence=test_seq,
        barcodes=[barcode_5prime, barcode_3prime],
        max_mismatches=1
    )
    print(f"Sequence: {test_seq}")
    print(f"  Match: {match}")
    print(f"  Category: {category}")
Test sequence: AAAAAATCGCGAGGCAAAAAAAGGCCGGCCGGAAAAAA
Found 2 matches:
  5 (FR) at position 6
    Barcode: 5
    Orientation: FR
    Position: 6
    Sequence: TCGCGAGGC
  3 (FR) at position 22
    Barcode: 3
    Orientation: FR
    Position: 22
    Sequence: GGCCGGCCGG

Classifying reads:
Sequence: AAAAAATCGCGAGGCAAAAAAAGGCCGGCCGGAAAAAA
  Match: 5 (FR) at position 5
  Category: barcode5_orientFR
Sequence: AAAAAAGCCTCGCGAAAAAAA
  Match: 5 (RC) at position 5
  Category: barcode5_orientRC
Sequence: AAAAAAGGCCGGCCTGAAAAAA
  Match: 3 (FR) at position 6
  Category: barcode3_orientFR
# Test with real data 
import os
import pysam
from tqdm.auto import tqdm
from BarcodeSeqKit.core import ExtractionStatistics

# Path to the test file (adjust if needed)
bam_file = "../tests/test.bam"

if os.path.exists(bam_file):
    print(f"Testing with {bam_file}")
    stats = ExtractionStatistics()
    
    # Define barcodes to search for
    example_barcodes = [
        BarcodeConfig(
            sequence="TAACTGAGGCCGGC",  # Example barcode to search for
            location=BarcodeLocationType.THREE_PRIME,
            name="3prime",
            description="3' barcode from test data"
        ),
        BarcodeConfig(
            sequence="CTGACTCCTTAAGGGCC",  # Example barcode to search for
            location=BarcodeLocationType.FIVE_PRIME,
            name="5prime",
            description="5' barcode from test data"
        )
    ]
    
    # Count matches
    with pysam.AlignmentFile(bam_file, "rb") as bam:
        for read in tqdm(bam):
            stats.total_reads += 1
            sequence = read.query_sequence
            if sequence:
                match, category = classify_read_by_first_match(
                    sequence=sequence,
                    barcodes=example_barcodes,
                    max_mismatches=0
                )
                if match:
                    stats.update_barcode_match(match, category)
    
    # Print statistics
    print("\nBarcode detection statistics:")
    print(f"Total reads: {stats.total_reads}")
    print(f"Total matches: {stats.total_barcode_matches}")
    
    for barcode_name, count in stats.matches_by_barcode.items():
        print(f"  {barcode_name}: {count} matches")
    
    for orientation, count in stats.matches_by_orientation.items():
        print(f"  Orientation {orientation}: {count} matches")
    
    for category, count in stats.matches_by_category.items():
        print(f"  Category {category}: {count} matches")
else:
    print(f"Test file not found: {bam_file}")
Test file not found: ../tests/test.bam

Conclusion

This notebook provides utility functions for sequence manipulation, barcode detection, and classification in BarcodeSeqKit. These functions are used by both the BAM and FASTQ processing modules to identify and categorize barcoded reads.