2.3 Integrating External Databases

2.3 Integrating External Databases#

The goal of this section is to show you how LLMs can easily be integrated into research workflows to accelerate scientific discovery. On top of everything, AI can take care of the most mundane trivial tasks.

🚀 How to run the notebook

This tutorial can be launched using the rocket button at the top of the page.

Option 1 — Google Colab (recommended)#

Opens the notebook in Google Colab with the fastest and most reliable experience.

Before running the tutorial, add your API keys using either:

a .env file, or
Colab Secrets (🔑 Secrets tab in the left sidebar)

Example .env:

OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here

Option 2 — MyBinder#

Launches a temporary cloud Jupyter environment directly in your browser.

⚠️ Binder environments can take a few minutes to build and start.

After the notebook loads, create a .env file in the notebook directory containing your API keys:

OPENAI_API_KEY=your_key_here
ANTHROPIC_API_KEY=your_key_here

Notes#

You only need API keys for the providers used in a given notebook.
Never commit or publicly share your API keys.
If a cell fails due to missing credentials, verify that your keys were loaded correctly before rerunning the cell.

2.3.1 Example 1: Uniprot integration#

Think of a scenario where a biologist wants to learn about the protein p53 (a critical tumor suppressor) – specifically its function, domains, and disease associations, without manually reading through dense database entries.

In this example we will use Anthropic’s claude-haiku-4-5-20251001 model. You will have to setup an Anthropic API Key before starting. Or you can use an OpenAI model in the example below. See Chapter 2.2.

Step 1: Query the UniProt REST API#

UniProt offers a free, no-authentication REST API. You can fetch data for p53 (human) using its UniProt accession ID P04637.

If you’re using Google Colab install the requirements by running the cell below.

# Install the necessary packages
!uv pip install requests anthropic py3Dmol python-dotenv

import requests

# Fetch p53 human protein entry from UniProt
accession = "P04637"
url = f"https://rest.uniprot.org/uniprotkb/{accession}.json"

response = requests.get(url)
data = response.json()

# Extract key fields
protein_name = data['proteinDescription']['recommendedName']['fullName']['value']
organism = data['organism']['scientificName']
function_text = next(
    c['texts'][0]['value']
    for c in data['comments']
    if c['commentType'] == 'FUNCTION'
)
sequence = data['sequence']['value']

print(f"Protein: {protein_name}")
print(f"Organism: {organism}")
print(f"Function: {function_text[:300]}...")
print(f"Sequence length: {data['sequence']['length']} aa")

Step 2: Pass the Data to an LLM#

Now feed the retrieved data into an LLM (e.g., Claude via the API) with a biologically meaningful prompt.

import os

LLM_API_KEYS = {
    "openai":    "OPENAI_API_KEY",
    "anthropic": "ANTHROPIC_API_KEY",
}

def get_api_key(llm: str = "anthropic") -> str:
    """
    Load API key for the specified LLM from Colab secrets,
    environment variable, or user input.
    
    Args:
        llm: LLM provider name. eg: 'openai', 'anthropic'
    
    Returns:
        API key string
    
    Example:
        api_key = get_api_key("anthropic")
    """

    llm = llm.lower()
    if llm not in LLM_API_KEYS:
        raise ValueError(
            f"Unknown LLM '{llm}'. Choose from: {list(LLM_API_KEYS.keys())}"
        )

    env_var = LLM_API_KEYS[llm]

    # 1. Try Colab secrets
    try:
        from google.colab import userdata
        key = userdata.get(env_var)
        if key:
            return key
    except ImportError:
        pass

    # 2. Try environment variable / .env file
    try:
        from dotenv import load_dotenv
        load_dotenv()
        key = os.environ.get(env_var)
        if key:
            return key
    except ImportError:
        pass

    raise ValueError(
        f"API key not found. Please set {env_var}:\n"
        f"  export {env_var}='your-key-here'\n"
        f"  or add it to a .env file"
    )

import anthropic

api_key = get_api_key(llm="anthropic")
client = anthropic.Anthropic(api_key=api_key)

# Build a prompt using the retrieved UniProt data
prompt = f"""
I retrieved the following information about a protein from UniProt:

Protein name: {protein_name}
Organism: {organism}
Function annotation: {function_text}
Sequence length: {data['sequence']['length']} amino acids

Please do the following:
1. Summarize what this protein does in plain language suitable for an undergraduate student.
2. Explain why this protein is clinically important.
3. Suggest 2–3 follow-up questions a researcher might want to investigate.
"""

message = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,  
    messages=[{"role": "user", "content": prompt}]
)

print(message.content[0].text)

If you want to use an OpenAI model instead, you only have to swap the clients and load the correct API Key. See example below or check the previous section on “Extracting information from literature”

from openai import OpenAI

client = OpenAI(api_key=get_api_key(llm="openai"))

response = client.responses.create(
    model="gpt-4.1-mini",
    input=prompt
)

print(response.output_text)

Step 3: Investigate Disease Variants#

You can pull disease variant annotations from the same UniProt entry and ask the LLM to interpret them:

# Extract disease associations
diseases = [
    c for c in data['comments']
    if c['commentType'] == 'DISEASE'
]

disease_summary = "\n".join([
    d.get('disease', {}).get('description', '') for d in diseases
])

variant_prompt = f"""
The following diseases are associated with mutations in {protein_name}:

{disease_summary}

What do these associations tell us about the protein's role in cancer biology? 
What types of mutations (gain-of-function vs loss-of-function) are typically seen?
"""

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{"role": "user", "content": variant_prompt}]
)

print(response.content[0].text)

2.3.2 Example 2: PDB databank integration#

Now let’s look at a scenario where a biologist wants to understand the 3D structure of p53 bound to DNA — what the structure looks like, what experimental method was used, and what biological functions the structure encodes.

Step 1: Search PDB for p53 Structures#

import requests

# Search PDB for human p53 structures
search_url = "https://search.rcsb.org/rcsbsearch/v2/query"

query = {
    "query": {
        "type": "terminal",
        "service": "full_text",
        "parameters": {"value": "p53 DNA binding human"}
    },
    "return_type": "entry",
    "request_options": {
        "paginate": {"start": 0, "rows": 5}
    }
}

response = requests.post(search_url, json=query)
results = response.json()

pdb_ids = [r['identifier'] for r in results['result_set']]
print("Found PDB entries:", pdb_ids)

Step 2: Fetch Detailed Metadata for a Structure#

# Use a well-known p53-DNA complex structure
pdb_id = "2OCJ"

# Fetch entry-level metadata
entry_url = f"https://data.rcsb.org/rest/v1/core/entry/{pdb_id}"
entry_data = requests.get(entry_url).json()

# Fetch polymer entity info (the protein chains)
entity_url = f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/1"
entity_data = requests.get(entity_url).json()

# Pull out key fields
title = entry_data['struct']['title']
method = entry_data['exptl'][0]['method']

# Resolution: try multiple possible field names, then fall back to rcsb_entry_info
def get_resolution(entry_data):
    # Try refine block first (X-ray)
    refine = entry_data.get('refine', [{}])
    for field in ['ls_d_res_high', 'ls_d_res_low', 'resolution_high']:
        val = refine[0].get(field)
        if val is not None:
            return val
    # Fall back to rcsb_entry_info (works for cryo-EM and other methods too)
    return entry_data.get('rcsb_entry_info', {}).get('resolution_combined', [None])[0]

resolution = get_resolution(entry_data)

organism = entity_data['rcsb_entity_source_organism'][0]['scientific_name']
description = entity_data['rcsb_polymer_entity']['pdbx_description']

print(f"Title: {title}")
print(f"Method: {method}")
print(f"Resolution: {resolution} Å" if resolution else "Resolution: N/A")
print(f"Organism: {organism}")
print(f"Entity: {description}")

Step 3: Fetch the Protein Sequence from PDB#

# Get the sequence of the p53 chain in this structure
sequence_url = f"https://data.rcsb.org/rest/v1/core/polymer_entity/{pdb_id}/1"
seq_data = requests.get(sequence_url).json()

sequence = seq_data['entity_poly']['pdbx_seq_one_letter_code_can']
seq_length = len(sequence)

print(f"Sequence length in structure: {seq_length} aa")
print(f"First 60 aa: {sequence[:60]}")

# Visualize the p53-DNA Structure

import py3Dmol

# Create a viewer and load the structure directly from RCSB
view = py3Dmol.view(query='pdb:2OCJ', width=800, height=500)

# Style the different components
view.setStyle({'chain': 'A'}, {'cartoon': {'color': 'lightblue'}})   # p53 chain A
view.setStyle({'chain': 'B'}, {'cartoon': {'color': 'lightgreen'}})  # p53 chain B
view.setStyle({'chain': 'C'}, {'cartoon': {'color': 'lightyellow'}}) # DNA chain
view.setStyle({'chain': 'D'}, {'cartoon': {'color': 'lightyellow'}}) # DNA chain

# Highlight DNA as sticks so it stands out
view.addStyle({'resn': ['DA', 'DT', 'DG', 'DC']}, {'stick': {'colorscheme': 'orangeCarbon'}})

# Zoom to fit
view.zoomTo()
view.show()

Step 4: Ask the LLM to Interpret the Structure#

import anthropic

client = anthropic.Anthropic(api_key=get_api_key(llm="anthropic"))

structure_prompt = f"""
I retrieved the following information about a protein crystal structure from the RCSB PDB:

PDB ID: {pdb_id}
Title: {title}
Experimental method: {method}
Resolution: {resolution} Å
Organism: {organism}
Protein: {description}
Sequence length in structure: {seq_length} amino acids

Please do the following:
1. Explain what this structure represents in plain language.
2. Explain what X-ray crystallography is and what resolution means — 
   is {resolution} Å considered good, and what level of detail does it provide?
3. Why is studying p53 in complex with DNA particularly important 
   for understanding its tumor suppressor function?
4. What are the limitations of studying a protein this way 
   (i.e., as a static crystal structure)?
"""

message = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{"role": "user", "content": structure_prompt}]
)

print(message.content[0].text)

Step 5: Compare Multiple Structures#

A powerful follow-up — fetch several p53 structures and ask the LLM to reason about the differences:

# Fetch metadata for several p53 structures
structures_info = []
for pid in pdb_ids[:3]:  # use first 3 results from search
    e = requests.get(f"https://data.rcsb.org/rest/v1/core/entry/{pid}").json()
    try:
        structures_info.append({
            "pdb_id": pid,
            "title": e['struct']['title'],
            "method": e['exptl'][0]['method'],
            "resolution": e.get('refine', [{}])[0].get('ls_d_res_high', 'N/A')
        })
    except KeyError:
        continue

comparison_prompt = f"""
Here are several PDB structures related to p53:

{structures_info}

1. What differences between these structures might be biologically meaningful?
2. Why might researchers solve multiple structures of the same protein?
3. How might differences in resolution affect which structure a researcher chooses for drug design?
"""

response = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=1024,
    messages=[{"role": "user", "content": comparison_prompt}]
)

print(response.content[0].text)

Summary of PDB API Endpoints Used#

Endpoint	What It Returns
`search.rcsb.org/rcsbsearch/v2/query`	Search for structures by keyword
`data.rcsb.org/rest/v1/core/entry/{id}`	Title, method, resolution, deposition date
`data.rcsb.org/rest/v1/core/polymer_entity/{id}/1`	Chain info, sequence, organism