Core data structures for barcode extraction from sequencing data
Introduction
BarcodeSeqKit is a library designed for extracting and processing barcoded sequences from next-generation sequencing data. This notebook contains the core data structures and utility functions for the library.
Enumerations
Let’s define the basic enumerations used throughout the library.
# Example of creating barcode configurationsbarcode_5prime = BarcodeConfig( sequence="TCGCGAGGC", location=BarcodeLocationType.FIVE_PRIME, name="5prime", description="5' barcode for phenotyping experiment")barcode_3prime = BarcodeConfig( sequence="GGCCGGCCGG", location=BarcodeLocationType.THREE_PRIME, name="3prime", description="3' barcode for phenotyping experiment")# Print information about the barcodesprint(f"5' barcode: {barcode_5prime.sequence}, RC: {barcode_5prime.reverse_complement}")print(f"3' barcode: {barcode_3prime.sequence}, RC: {barcode_3prime.reverse_complement}")# Create a basic extraction configurationconfig = BarcodeExtractorConfig( barcodes=[barcode_5prime, barcode_3prime], output_prefix="test_output", output_dir="../tests/core_output", max_mismatches=0, verbose=True)# Save the configuration to YAMLconfig.save_yaml("../tests/core_output/test_config.yaml")print("Configuration saved to test_config.yaml")# Load the configuration backloaded_config = BarcodeExtractorConfig.load_yaml("../tests/core_output/test_config.yaml")print(f"Loaded {len(loaded_config.barcodes)} barcodes from config file")
5' barcode: TCGCGAGGC, RC: GCCTCGCGA
3' barcode: GGCCGGCCGG, RC: CCGGCCGGCC
Configuration saved to test_config.yaml
Loaded 2 barcodes from config file
Basic Utility Functions
Let’s add some basic utility functions for common operations.
Conclusion
This notebook establishes the core data structures for the BarcodeSeqKit library It provides all the necessary components for barcode configuration, matching, and statistics tracking, which will be used by the specialized processors for BAM and FASTQ files.