Chapter 1: Protein Families and Classification#

1. Introduction#

This section explores how proteins are organized into families and superfamilies based on their evolutionary and structural relationships. Proteins are the primary targets for most therapeutic drugs, representing approximately 90% of current drug targets. Understanding how proteins are organized into families and classified is fundamental to drug discovery because proteins within the same family often share structural features, functional mechanisms, and binding sites. This knowledge allows researchers to predict drug-target interactions, identify new therapeutic targets, and repurpose existing drugs across related protein families.


2. Key Concepts and Definitions#

  • Protein Family: A group of proteins that share significant sequence similarity, structural features, and often similar functions, suggesting they evolved from a common ancestral protein.

  • Homology: The relationship between proteins that share a common evolutionary origin. Proteins sharing >30% sequence identity are generally considered homologous.

  • Domain: A distinct functional and structural unit within a protein that can evolve, function, and exist independently. Domains are the fundamental building blocks of proteins.

  • Superfamily: A broader grouping than a family, containing proteins with low sequence similarity but clear structural and functional relationships.

  • Ortholog:Proteins in different species that evolved from a common ancestral gene and typically retain the same function.

  • Paralog: Proteins within the same organism that arose through gene duplication and may have diverged in function.


3. Main Content#

3.1 Methods of Protein Classification#

Proteins are organized into hierarchical classification systems based on evolutionary and structural relationships.

  1. Sequence-based families group proteins with significant sequence similarity (typically >30% identity). Tools like BLAST identify relationships through sequence alignment. Multiple sequence alignments reveal conserved motifs and functional residues. The Pfam database contains over 19,000 protein families defined by conserved sequence domains.

  2. Structure-based classification organizes proteins by 3D structural features. The SCOP (Structural Classification of Proteins) database uses four levels: Class (overall secondary structure content), Fold (arrangement of secondary structures), Superfamily (probable evolutionary relationship), and Family (clear evolutionary relationship). CATH uses a similar hierarchy: Class, Architecture, Topology, Homology.

  3. Domain-based classification recognizes that most proteins contain multiple functional domains. Common domains appear in various combinations across different proteins. InterPro integrates information from multiple databases to classify proteins by domains and functional sites.

  4. Functional classification groups proteins by their biological or biochemical function regardless of sequence similarity. The Enzyme Commission (EC) number system classifies enzymes by reaction type. Gene Ontology (GO) provides standardized terms for molecular function, biological process, and cellular component.

3.2 Classification of Proteins by Function#

Class

Function

Key Features

Examples

PDB Structure

Enzymes

Catalyze biochemical reactions; largest functional class

Active sites with catalytic residues; substrate specificity; cofactor requirements

Oxidoreductases, transferases, hydrolases, lyases, isomerases, ligases; kinases, proteases, polymerases

1ATP - ATP synthase
3PTB - Trypsin (protease)
1TAQ - Taq polymerase

Receptors

Bind signaling molecules and transmit signals across membranes or within cells

Ligand-binding domains; conformational changes upon binding; signal transduction capability

GPCRs, RTKs, nuclear receptors, ion channel receptors, cytokine receptors

3SN6 - β2-adrenergic receptor (GPCR)
1IRK - Insulin receptor kinase (RTK)
1A52 - Estrogen receptor

Transport Proteins

Move molecules across membranes or through body fluids

Selective binding pockets; conformational changes for transport; gating mechanisms

ABC transporters, solute carriers, hemoglobin, albumin, ion channels

1HHO - Hemoglobin
1AO6 - Human serum albumin
5A7X - Potassium channel

Structural Proteins

Provide mechanical support and shape to cells and tissues

High tensile strength; fibrous or filamentous structure; repetitive sequences

Collagen, keratin, actin, tubulin, elastin

3HR2 - Collagen triple helix
1ATN - Actin
1TUB - Tubulin

Regulatory Proteins

Control gene expression and cellular processes

DNA-binding domains; specific sequence recognition; influence transcription

Transcription factors (p53, NF-κB), steroid hormone receptors, tumor suppressors, oncoproteins

2OCJ - p53 tumor suppressor
1NFI - NF-κB
2Q6H - c-Myc transcription factor

Signaling Proteins

Mediate cell communication

Small size for diffusion; receptor binding domains; often secreted

Cytokines, growth factors (EGF, VEGF), hormones (insulin, growth hormone), G-proteins, kinases

1EGF - Epidermal growth factor
4INS - Insulin
3V23 - VEGF

Immune Proteins

Defend against pathogens

Antigen recognition; variable regions; complement cascade activation

Antibodies (IgG, IgA, IgM), complement proteins, cytokines

1IGT - IgG antibody
1HFI - Immunoglobulin Fab fragment
2XQW - Complement C3

Storage Proteins

Store amino acids and ions

High capacity for binding; regulated release; often in vesicles or granules

Ferritin (iron), casein (amino acids), ovalbumin (nutrients)

1FHA - Ferritin
1OVA - Ovalbumin

Motor Proteins

Generate movement

ATP-binding sites; conformational changes drive motion; directional movement

Myosin (muscle contraction), kinesin, dynein (cargo transport)

2MYS - Myosin
3KIN - Kinesin motor domain
4RH7 - Dynein motor domain

⚠️ WARNING
You can run the code below to view the different proteins. Click on the rocket icon on top of the Jupyter-book and click Live Code. Wait for the kernel to load before running the codes
!pip install py3Dmol
import py3Dmol

# PDB IDs
proteins = {
    "3PTB": "Enzymes: Trypsin",
    "3RFM": "Receptors: A2A receptor",
    "1A3N": "Transport: Hemoglobin",
    "1BKV": "Structural: Collagen",
    "7EZJ": "Regulatory: p53",
    "4INS": "Signaling: Insulin",
    "1IGT": "Immune: IgG Antibody",
    "1FHA": "Storage: Ferritin",
    "2MYS": "Motor: Myosin"
}

# --- 1. Display the protein menu ---
print("--- Protein Selection Menu ---")
for pdb_id, description in proteins.items():
    print(f"  [{pdb_id}]: {description}")
print("------------------------------")

# --- 2. Get the protein input() ---
selected_id = input("Enter the 4-character PDB ID to render: ").upper().strip()

# --- 3. Validate choice and render the protein ---
if selected_id in proteins:
    print(f"\nLoading {selected_id} ({proteins[selected_id]})...")
    viewer = py3Dmol.view(query=f'pdb:{selected_id}', width=600, height=600)
    viewer.setStyle({'cartoon': {'color': 'spectrum'}})
    viewer.zoomTo()
    viewer.show()
else:
    print(f"\nError: '{selected_id}' is not a valid choice.")
    print("Please re-run the cell and select a PDB ID from the list.")

Major Protein Families in Drug Discovery#

  1. Kinase superfamily contains over 500 human members that transfer phosphate groups from ATP to substrates. Divided into serine/threonine kinases, tyrosine kinases, and dual-specificity kinases. Kinase inhibitors are major cancer therapeutics (imatinib, erlotinib).

  2. G-Protein Coupled Receptors (GPCR) superfamily includes ~800 human members with seven transmembrane helices. Divided into Class A (rhodopsin-like), Class B (secretin-like), Class C (glutamate-like), and others. GPCRs are the largest group of drug targets, targeted by ~34% of FDA-approved drugs.

  3. Nuclear receptor superfamily contains 48 human members that are ligand-activated transcription factors. Includes steroid receptors, thyroid receptors, and orphan receptors. Targeted by drugs for inflammation, cancer, and metabolic diseases.

  4. Protease families are enzymes that cleave peptide bonds. Classified by catalytic mechanism: serine proteases, cysteine proteases, aspartic proteases, and metalloproteases. Targets for antivirals (HIV protease inhibitors) and cardiovascular drugs (ACE inhibitors).

  5. Ion channel superfamilies include voltage-gated channels (sodium, potassium, calcium), ligand-gated channels, and mechanically-gated channels. Critical drug targets for neurological, cardiovascular, and pain disorders.

  6. Transporter families move molecules across membranes. ABC transporters use ATP, while solute carriers use electrochemical gradients. Important for drug absorption and resistance mechanisms.

Why is it important to classify proteins into families?#

Proteins within families share conserved active sites and binding pockets, which enables structure-based drug design but also creates selectivity challenges. Closely related paralogs may bind the same drug, causing off-target effects. However, subtle differences between family members can be exploited for selective inhibitor design. Understanding family relationships enables homology modeling to predict structures, virtual screening across family members, drug repurposing to related targets, and prediction of resistance mutations.


4. Summary and Key Takeaways#

In this section, we’ve explored how proteins are classified into families and superfamilies to understand drug action. By recognizing that shared ancestry leads to conserved structures, we can predict and explain why a drug may bind to multiple targets. This knowledge is a cornerstone of modern, structure-based drug design, moving beyond a single-target view to appreciate the complex network of interactions that define a drug’s true effect.

  • Proteins are functionally categorized as enzymes, receptors, transporters, structural proteins, regulatory proteins, signaling proteins, immune proteins, storage proteins, and motor proteins.

  • Structurally, they are classified as fibrous, globular, membrane, or intrinsically disordered proteins.

  • Major drug target families include kinases (>500 members), GPCRs (~800 members), nuclear receptors (48 members), proteases, ion channels, and transporters.

  • Understanding protein types and families is essential for identifying drug targets, predicting off-target effects, designing selective inhibitors, and repurposing drugs