Computational biology workflows increasingly rely on sophisticated statistical approaches, machine learning (ML) techniques and generative artificial intelligence (AI) agents to handle high-throughput datasets. Despite the growing role of AI in bioinformatics, these methods introduce analytical pitfalls that can undermine results, misguide interpretations and erode trust in translational outcomes.
This webinar highlights three common but often overlooked pitfalls in bioinformatics workflows, specifically focusing on compositional data handling, interpretation of ML-derived feature importance and effective use of AI agents for ranking tasks.
Pitfall 1: Mismanaging Compositional Data (RNA-seq and Beyond)
RNA sequencing generates compositional data — where the values for each sample represent parts of a whole and always add up to a fixed total. A common mistake in analyzing this type of data is filtering out genes with low counts or low variability. While this may seem helpful, it disrupts the balance of the dataset (known as compositional closure) and can lead to misleading statistical results.
In this webinar, the speaker will show how including a residual category — a placeholder for the filtered-out genes — helps maintain the integrity of the data and avoids skewing the analysis.
Another issue arises when dealing with zero values. Many standard approaches add small pseudocounts to manage these zeros, but this can introduce bias. A better solution is the PFLog1PF transformation, a method that handles zeros more reliably and supports accurate downstream analyses like principal component analysis (PCA) and clustering.
Understanding these pitfalls is crucial not only for RNA-seq but also for other fields that rely on compositional data — including microbiome profiling, dietary intake data, flow cytometry and competitive market share, making this understanding widely relevant.
Pitfall 2: Overinterpreting ML Feature Importance
ML methods often generate feature importance scores to highlight which variables — such as genes, proteins or clinical markers — most influence their predictions. These scores are frequently used as stand-ins for biological or clinical significance, but this can be problematic. The values can vary widely depending on the specific algorithm or even the software implementation used (for example, random forests in Python vs. R), and they typically don’t provide any built-in measure of uncertainty.
This webinar will reframe feature importance as a statistical measurement — one that is inherently variable and should be interpreted with caution. The speaker will walk through practical examples showing how these scores can be evaluated more rigorously using statistical techniques. By applying methods like bootstrapped permutation testing, researchers can better understand whether the importance of a given feature reflects a real biological signal or is from random noise.
Participants will gain a reusable analytical framework for rigorous interpretation of feature importance in their workflows.
Pitfall 3: Misapplying Generative AI for Ranking and Prioritization
Generative AI tools, including large language models (LLMs), are increasingly used to help rank biological entities like genes, biomarkers or drug candidates. While these models offer a fast and convenient way to generate rankings, relying on them as infallible can lead to results that are inconsistent, biased or difficult to reproduce.
This webinar will introduce a more reliable way to use generative AI in prioritization tasks. The speaker will show how to integrate generative AI into controlled ranking processes using pairwise comparisons — where items are evaluated two at a time — and statistical models like the Bradley-Terry method. This allows researchers to generate rankings that are not only reproducible but also come with clear measures of confidence.
By integrating generative AI within a robust analytical framework, this approach enhances reproducibility and trust in AI-assisted decision-making across bioinformatics workflows.
Register for this webinar to learn strategies and methods to detect, understand and circumvent these hidden pitfalls of AI in bioinformatics. The session emphasizes reproducible, trustworthy bioinformatics practices that significantly enhance confidence in analytical results, supporting critical decisions throughout the drug discovery and clinical development pipelines.
Speaker

Juan Felipe Beltrán, Director of AI, Machine Learning and Innovation, BullFrog AI
Dr. Juan Felipe Beltrán is an accomplished scientist and software engineer working in algorithm development, ML and bioinformatics. Dr. Beltrán has a proven track record in designing and implementing innovative solutions for complex biological data analysis, advancing proteomics/genomics research and contributing to the fight against global health challenges like COVID-19. Dr. Beltrán is adept at interdisciplinary collaboration, project management and data visualization, with a passion for using computational approaches to improve human health and advance scientific understanding.
Who Should Attend?
This webinar will benefit biotech and pharma professionals with the following titles:
- Bioinformaticians and Biostatisticians
- Data Scientists and ML Scientists
- Translational Scientists
- Computational Biologists
What You Will Learn
Attendees will learn how to:
- Maintain statistical integrity when analyzing compositional biological data by avoiding common preprocessing mistakes
- Interpret ML feature importance scores using bootstrapped permutation testing to distinguish meaningful biological signals from noise and quantify uncertainty
- Effectively integrate generative AI tools into ranking and prioritization workflows by applying structured methodologies to ensure reproducibility and statistical robustness
Xtalks Partner
Bullfrog AI
BullFrog AI leverages Artificial Intelligence and machine learning to advance drug discovery and development. Through collaborations with leading research institutions, BullFrog AI uses causal AI in combination with its proprietary bfLEAP® platform to analyze complex biological data, aiming to streamline therapeutics development and reduce failure rates in clinical trials.
You Must Login To Register for this Free Webinar
Already have an account? LOGIN HERE. If you don’t have an account you need to create a free account.
Create Account