BarcodeSeqKit Core

Core data structures for barcode extraction from sequencing data

Introduction

BarcodeSeqKit is a library designed for extracting and processing barcoded sequences from next-generation sequencing data. This notebook contains the core data structures and utility functions for the library.

Enumerations

Let’s define the basic enumerations used throughout the library.


source

OrientationType

 OrientationType (value, names=None, module=None, qualname=None,
                  type=None, start=1)

Orientation types for barcode sequences.


source

BarcodeLocationType

 BarcodeLocationType (value, names=None, module=None, qualname=None,
                      type=None, start=1)

Location types for barcodes.


source

BarcodeExtractorConfig

 BarcodeExtractorConfig (barcodes:List[__main__.BarcodeConfig],
                         output_prefix:str, output_dir:str='.',
                         max_mismatches:int=0,
                         search_softclipped:bool=False,
                         verbose:bool=False, log_file:Optional[str]=None,
                         write_output_files:bool=True)

Configuration for barcode extraction.


source

BarcodeConfig

 BarcodeConfig (sequence:str,
                location:__main__.BarcodeLocationType=<BarcodeLocationType
                .UNKNOWN: 'UNK'>, name:Optional[str]=None,
                description:Optional[str]=None)

Configuration for a barcode sequence.


source

BarcodeMatch

 BarcodeMatch (barcode:__main__.BarcodeConfig,
               orientation:__main__.OrientationType, position:int,
               sequence:str)

Represents a match of a barcode in a sequence.

Statistics Tracking


source

ExtractionStatistics

 ExtractionStatistics (total_reads:int=0, total_barcode_matches:int=0,
                       matches_by_barcode:Dict[str,int]=<factory>,
                       matches_by_orientation:Dict[str,int]=<factory>,
                       matches_by_category:Dict[str,int]=<factory>)

Statistics collected during barcode extraction.

File Format Utilities


source

FileFormat

 FileFormat (value, names=None, module=None, qualname=None, type=None,
             start=1)

Supported file formats.

Example Usage

# Example of creating barcode configurations
barcode_5prime = BarcodeConfig(
    sequence="TCGCGAGGC",
    location=BarcodeLocationType.FIVE_PRIME,
    name="5prime",
    description="5' barcode for phenotyping experiment"
)

barcode_3prime = BarcodeConfig(
    sequence="GGCCGGCCGG",
    location=BarcodeLocationType.THREE_PRIME,
    name="3prime",
    description="3' barcode for phenotyping experiment"
)
# Print information about the barcodes
print(f"5' barcode: {barcode_5prime.sequence}, RC: {barcode_5prime.reverse_complement}")
print(f"3' barcode: {barcode_3prime.sequence}, RC: {barcode_3prime.reverse_complement}")

# Create a basic extraction configuration
config = BarcodeExtractorConfig(
    barcodes=[barcode_5prime, barcode_3prime],
    output_prefix="test_output",
    output_dir="../tests/core_output",
    max_mismatches=0,
    verbose=True
)

# Save the configuration to YAML
config.save_yaml("../tests/core_output/test_config.yaml")
print("Configuration saved to test_config.yaml")

# Load the configuration back
loaded_config = BarcodeExtractorConfig.load_yaml("../tests/core_output/test_config.yaml")
print(f"Loaded {len(loaded_config.barcodes)} barcodes from config file")
5' barcode: TCGCGAGGC, RC: GCCTCGCGA
3' barcode: GGCCGGCCGG, RC: CCGGCCGGCC
Configuration saved to test_config.yaml
Loaded 2 barcodes from config file

Basic Utility Functions

Let’s add some basic utility functions for common operations.

Conclusion

This notebook establishes the core data structures for the BarcodeSeqKit library It provides all the necessary components for barcode configuration, matching, and statistics tracking, which will be used by the specialized processors for BAM and FASTQ files.