AI Advances in DNA: ‘genomic sequences as structured language’

By Jim Shimabukuro (assisted by Copilot)
Editor

Overview: Work at the intersection of artificial intelligence and DNA now spans fundamental genomics, genome editing, clinical translation, and ethics, and a small set of authors recur across the most influential, recent contributions. Anshul Kundaje and collaborators such as Katherine S. Pollard and Jian Ma are central voices on using deep learning to decode regulatory DNA and molecular biology more broadly, articulating both technical advances and conceptual roadmaps for AI in molecular biology.5 Chong Wu and Peng Wei have emerged as leading figures in DNA foundation language models, benchmarking and comparing architectures that treat DNA as a “language” and setting standards for how such models should be evaluated and selected for real genomic tasks.6,10

Image created by Copilot

On the modeling side, Veniamin Fishman and Mikhail Burtsev, together with colleagues, have pushed long-context DNA language models (GENA‑LM), while Xiang Zhang and co‑authors (DeepGene) and Zehui Li and co‑authors (Omni‑DNA) are shaping the next generation of efficient, cross‑modal genomic foundation models.7-9 In parallel, S. D. Shriniket Dixit and co‑authors have become key reference points for AI‑enhanced CRISPR genome editing, and Radha Nagarajan and Nephi Walton are prominent in framing how AI in genomics moves into clinical and laboratory practice.2,3 Finally, Harry Farmer and colleagues at the Ada Lovelace Institute stand out for mapping the ethical, legal, and societal terrain of AI in genomics, complementing the technical literature with governance‑oriented analysis.4

From this landscape, three cutting‑edge subtopics stand out as especially important: DNA foundation language models for genomics; AI‑guided genome editing and CRISPR design; and AI‑driven clinical genomics and ethical governance. Each captures a different layer of how AI and DNA are being woven together—from sequence representation, to precise editing, to real‑world deployment and oversight.

DNA foundation language models for genomics

DNA foundation language models treat genomic sequences as a structured “language,” using transformer and related architectures to learn representations that can be reused across many downstream tasks, from variant effect prediction to gene expression modeling. This subtopic is cutting‑edge because it promises a unifying substrate for genomic analysis, analogous to how large language models transformed natural language processing. Haonan Feng, Chong Wu, Peng Wei, and collaborators have produced one of the most comprehensive benchmarks of DNA foundation models to date, systematically comparing models such as DNABERT‑2, Nucleotide Transformer V2, HyenaDNA, Caduceus‑Ph, and GROVER across diverse genomic and genetic tasks.6

Their work is notable not only for its breadth of evaluation but also for its methodological clarity: they show, for example, that choices like mean token embedding can shift performance as much as changing the model itself, and they explicitly connect model design decisions to practical outcomes in tasks like pathogenic variant identification and gene expression prediction.6,10 This combination of rigorous benchmarking and clinically relevant framing makes Wu and colleagues central authors in this subtopic.

A second prominent cluster of authors is led by Veniamin Fishman and Mikhail Burtsev, whose GENA‑LM family of transformer‑based DNA language models focuses on long sequences—up to 36,000 base pairs—and introduces recurrent memory mechanisms to extend context even further.8 Their contribution is important because many regulatory and structural genomic phenomena depend on long‑range interactions, and GENA‑LM explicitly tackles the challenge of capturing rich contextual information dispersed across thousands of nucleotides. By releasing multispecies and taxon‑specific models, along with open‑source code and a web service for DNA annotation, they also embody an open, infrastructure‑building approach that accelerates adoption by the wider community.8

A third key author group in this subtopic is Xiang Zhang and co‑authors, who introduced DeepGene, an efficient foundation model for genomics based on a pan‑genome graph transformer.9 DeepGene is distinctive in how it addresses three persistent challenges in DNA language modeling: genetic language diversity across individuals and populations, model efficiency at scale, and length extrapolation from short to long sequences.9 By leveraging pan‑genome and minigraph representations and demonstrating top performance on a broad Genome Understanding Evaluation benchmark, Zhang and colleagues show that foundation models can be both compact and state‑of‑the‑art, which is crucial for making these tools usable beyond a handful of well‑resourced centers.9 Together, the work of Wu, Fishman, Burtsev, Zhang, and their collaborators defines the current frontier of DNA foundation language models, shaping how AI “reads” DNA at scale.

AI‑guided genome editing and CRISPR design

AI‑guided genome editing focuses on using machine learning to design and optimize CRISPR‑based interventions, including guide RNA selection, off‑target prediction, and the tuning of base, prime, and epigenome editing systems. This subtopic is cutting‑edge because it directly links AI to the ability to rewrite DNA with increasing precision, raising both therapeutic possibilities and safety concerns. S. D. Shriniket Dixit, Anant Kumar, Kathiravan Srinivasan, and co‑authors have authored a comprehensive review on advancing genome editing with AI, which has quickly become a touchstone for the field.2

They synthesize how tools like DeepCRISPR, CRISTA, and DeepHF use deep learning to predict optimal guide RNAs, incorporating genomic context, Cas protein type, desired mutation, and on‑/off‑target scores, and they extend the discussion to advanced modalities such as base and prime editing.2 Their work stands out because it bridges algorithmic details with concrete disease applications—such as sickle cell anemia and thalassemia—and explicitly frames AI as a way to improve precision, efficiency, and affordability in genome editing.2

Dixit and colleagues are also notable for their clear articulation of the remaining challenges: off‑target editing, delivery methods for CRISPR cargo, editing efficiency, and safety in clinical applications.2 By treating AI not as a magic solution but as a set of tools embedded in a broader experimental and clinical pipeline, they provide a realistic roadmap for how AI‑driven genome editing might mature. This balanced perspective is a key reason to regard them as prominent authors in this subtopic. Complementing this, broader reviews of AI and machine learning in biology, such as the 2025 MDPI article by Zaw Myo Hein and co‑authors, situate CRISPR design within a continuum from gene function prediction to protein structure modeling, highlighting how transformer architectures and large language models are reshaping tasks from regulatory element detection to novel protein and drug design.1

These works collectively underscore why AI‑guided genome editing is a major frontier: it compresses the design–build–test cycle for genetic interventions, enables more systematic exploration of the CRISPR design space, and raises the stakes for governance because errors or biases in models can translate into real biological consequences. Authors like Dixit, Kumar, and Srinivasan are central because they integrate technical, biological, and translational perspectives, making their analyses indispensable for anyone trying to understand how AI and DNA converge in the editing lab.

AI‑driven clinical genomics and ethical governance

The third major subtopic concerns how AI‑enabled genomic analysis moves into clinical practice and how its ethical, legal, and societal implications are understood and governed. This area is cutting‑edge because the bottleneck is no longer only algorithmic performance; it is also about integrating AI into healthcare systems, ensuring fairness and privacy, and managing the societal impact of AI‑genomics products. Radha Nagarajan, Chen Wang, Derek Walton, and Nephi Walton have authored a 2024 review on AI applications in genomics that focuses explicitly on clinical and laboratory settings.3

They emphasize how AI and machine learning are becoming essential as genomic data volumes grow, enabling new insights into disease genetics and supporting personalized medical care through tools like polygenic risk scores and automated variant interpretation.3 Their work is particularly important because it details how AI models are already routinely used in laboratories, and it identifies practical challenges—data privacy, model interpretability, data availability in clinical settings, and regulatory hurdles—that must be addressed for safe, widespread deployment.3

On the governance and societal side, Harry Farmer and colleagues at the Ada Lovelace Institute have produced the “DNA.I” report, which surveys AI‑powered developments in genomics and maps the legal, ethical, and societal challenges they pose.4 This report is notable for its breadth: it connects national genomic initiatives, such as the UK’s ambition to become a leading genomic healthcare system, with debates over CRISPR ethics and the implications of foundation models trained on genomic data.4 Farmer and co‑authors analyze trends and predictions for the next five to ten years, highlighting issues like data governance, transparency, and the economic forces driving AI‑genomics integration.4 Their work is a key reference because it treats AI and genomics as intertwined socio‑technical systems rather than isolated technologies, offering a framework for policymakers, regulators, and researchers to think about responsible development.

Finally, voices like Kundaje, Pollard, and Ma, writing in Molecular Cell on “artificial intelligence in molecular biology,” help connect the clinical and ethical discussions back to the underlying science of regulatory DNA and variant interpretation.5 By emphasizing the need for interpretability, robust evaluation, and careful consideration of how models handle non‑coding variation and cis‑regulatory logic, they implicitly shape what “responsible” AI‑driven genomics should look like at the molecular level.5 Taken together, authors such as Nagarajan, Walton, Farmer, Kundaje, and their collaborators are prominent in this subtopic because they articulate how AI‑DNA technologies move from bench to bedside and into society, and they foreground the constraints and values that must guide that transition.

References

  1. Hein ZM et al., “AI and Machine Learning in Biology: From Genes to Proteins,” Biology, 2025. https://www.mdpi.com/2079-7737/14/10/1453 (mdpi.com in Bing)
  2. Dixit SDS et al., “Advancing genome editing with artificial intelligence: opportunities, challenges, and future directions,” Frontiers in Bioengineering and Biotechnology, 2024. https://www.frontiersin.org/articles/10.3389/fbioe.2023.1335901 (frontiersin.org in Bing)
  3. Nagarajan R et al., “Artificial Intelligence Applications in Genomics,” Advances in Molecular Pathology, 2024. https://www.sciencedirect.com/science/article/pii/S2589933324000360 (sciencedirect.com in Bing)
  4. Farmer H et al., “DNA.I: Early findings and emerging questions on the use of AI in genomics,” Ada Lovelace Institute, 2023. https://www.adalovelaceinstitute.org/report/dna-i
  5. Kundaje A et al., “Artificial intelligence in molecular biology,” Molecular Cell, 2025. https://www.cell.com/molecular-cell/fulltext/S1097-2765(24)00876-0 (cell.com in Bing)
  6. Feng H et al., “Benchmarking DNA foundation models for genomic and genetic tasks,” Nature Communications, 2025. https://www.nature.com/articles/s41467-025-65823-8 (nature.com in Bing)
  7. Li Z et al., “Omni-DNA: A Unified Genomic Foundation Model for Cross-Modal and Multi-Task Learning,” arXiv, 2025. https://arxiv.org/abs/2502.03499
  8. Fishman V et al., “GENA-LM: a family of open-source foundational DNA language models for long sequences,” Nucleic Acids Research, 2025. https://academic.oup.com/nar/article/53/2/gkae1310/7958870 (academic.oup.com in Bing)
  9. Zhang X et al., “DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer,” bioRxiv, 2024. https://www.biorxiv.org/content/10.1101/2024.04.24.590879v1 (biorxiv.org in Bing)
  10. “Comparing DNA language models to guide optimal AI selection for genomics,” University of Texas MD Anderson Cancer Center, Medical Xpress, 2025. https://medicalxpress.com/news/2025-12-dna-language-optimal-ai-genomics.html (medicalxpress.com in Bing)

Addendum

ChatGPT

Among the most prominent figures in genomic foundation models is Patrick Hsu, whose work at the Arc Institute has helped define the idea of “DNA foundation models” capable of modeling biological sequences across scales. His group’s Evo model, for example, applies deep learning to interpret and design DNA sequences at single-nucleotide resolution, effectively positioning DNA as a computable substrate analogous to text in NLP systems [Wikipedia,Wikipedia]. Closely related are contributors like Eric Xing and Ziv Bar-Joseph (through GenBio AI), who are building “digital organism” systems that simulate DNA, RNA, and cellular processes simultaneously, signaling a move toward unified biological AI platforms [Wikipedia]. A landmark technical milestone in this domain is DeepMind’s AlphaGenome, which models long DNA sequences and predicts gene regulation and mutation effects across multiple biological modalities [Wikipedia]. These authors and systems are selected because they are not merely applying AI to biology—they are redefining biology itself as an information science, with DNA as a sequence to be learned, predicted, and generated.

AI-driven gene editing and generative biology pushes beyond understanding DNA into actively designing it. Again, Patrick Hsu is central, particularly through his work on AI-designed gene-editing systems and programmable recombination techniques that can manipulate massive genomic regions with precision [Wikipedia]. Researchers like Lars Velten and colleagues at the Center for Genomic Regulation represent another critical strand, demonstrating that AI can generate entirely novel DNA regulatory sequences tailored to specific cellular behaviors—essentially “writing code” for cells [phys.org]. Institutions such as the Wellcome Sanger Institute, through contributors synthesizing ongoing research, highlight the broader movement toward “generative biology,” where AI is used to predict and engineer DNA and proteins as designable artifacts [Sanger]. These figures are selected because they exemplify a shift from descriptive biology to synthetic, programmable biology, where AI enables researchers to create biological functions that do not exist in nature. This transition mirrors the evolution of computing from reading data to generating new content, suggesting that biology itself is becoming an engineering discipline.

AI in clinical genomics and disease prediction focuses on the application of AI to interpret DNA in medical contexts, particularly for diagnosis, prognosis, and personalized medicine. Here, figures like Olivier Elemento stand out for integrating machine learning with large-scale genomic and clinical datasets to advance precision medicine, especially in oncology.[Wikipedia] Parallel efforts at institutions such as Harvard Medical School have produced models like popEVE, which predict whether specific genetic variants are disease-causing, addressing one of the central bottlenecks in genomic medicine.[Harvard] Researchers like Raúl Rabadán further extend this work by using AI to predict gene activity across cell types, linking genomic variation to functional biological outcomes.[Columbia] These authors are selected because they are translating AI-DNA integration into clinical impact, helping move genomics from a descriptive science to a predictive and actionable one. Their work also highlights ongoing challenges, including data bias, interpretability, and the difficulty of integrating heterogeneous biological and clinical data.[ScienceDirect]

Leave a comment