Chapter 1: Protein Families and Classification

Chapter 1: Protein Families and Classification#

1. Introduction#

This section explores how proteins are organized into families and superfamilies based on their evolutionary and structural relationships. Proteins are the primary targets for most therapeutic drugs, representing approximately 90% of current drug targets. Understanding how proteins are organized into families and classified is fundamental to drug discovery because proteins within the same family often share structural features, functional mechanisms, and binding sites. This knowledge allows researchers to predict drug-target interactions, identify new therapeutic targets, and repurpose existing drugs across related protein families.

2. Key Concepts and Definitions#

Protein Family: A group of proteins that share significant sequence similarity, structural features, and often similar functions, suggesting they evolved from a common ancestral protein.
Homology: The relationship between proteins that share a common evolutionary origin. Proteins sharing >30% sequence identity are generally considered homologous.
Domain: A distinct functional and structural unit within a protein that can evolve, function, and exist independently. Domains are the fundamental building blocks of proteins.
Superfamily: A broader grouping than a family, containing proteins with low sequence similarity but clear structural and functional relationships.
Ortholog:Proteins in different species that evolved from a common ancestral gene and typically retain the same function.
Paralog: Proteins within the same organism that arose through gene duplication and may have diverged in function.

3. Main Content#

3.1 Methods of Protein Classification#

Proteins are organized into hierarchical classification systems based on evolutionary and structural relationships.

Sequence-based families group proteins with significant sequence similarity (typically >30% identity). Tools like BLAST identify relationships through sequence alignment. Multiple sequence alignments reveal conserved motifs and functional residues. The Pfam database contains over 19,000 protein families defined by conserved sequence domains.
Structure-based classification organizes proteins by 3D structural features. The SCOP (Structural Classification of Proteins) database uses four levels: Class (overall secondary structure content), Fold (arrangement of secondary structures), Superfamily (probable evolutionary relationship), and Family (clear evolutionary relationship). CATH uses a similar hierarchy: Class, Architecture, Topology, Homology.
Domain-based classification recognizes that most proteins contain multiple functional domains. Common domains appear in various combinations across different proteins. InterPro integrates information from multiple databases to classify proteins by domains and functional sites.
Functional classification groups proteins by their biological or biochemical function regardless of sequence similarity. The Enzyme Commission (EC) number system classifies enzymes by reaction type. Gene Ontology (GO) provides standardized terms for molecular function, biological process, and cellular component.

3.2 Classification of Proteins by Function#

Class	Function	Key Features	Examples	PDB Structure
Enzymes	Catalyze biochemical reactions; largest functional class	Active sites with catalytic residues; substrate specificity; cofactor requirements	Oxidoreductases, transferases, hydrolases, lyases, isomerases, ligases; kinases, proteases, polymerases	1ATP - ATP synthase 3PTB - Trypsin (protease) 1TAQ - Taq polymerase
Receptors	Bind signaling molecules and transmit signals across membranes or within cells	Ligand-binding domains; conformational changes upon binding; signal transduction capability	GPCRs, RTKs, nuclear receptors, ion channel receptors, cytokine receptors	3SN6 - β2-adrenergic receptor (GPCR) 1IRK - Insulin receptor kinase (RTK) 1A52 - Estrogen receptor
Transport Proteins	Move molecules across membranes or through body fluids	Selective binding pockets; conformational changes for transport; gating mechanisms	ABC transporters, solute carriers, hemoglobin, albumin, ion channels	1HHO - Hemoglobin 1AO6 - Human serum albumin 5A7X - Potassium channel
Structural Proteins	Provide mechanical support and shape to cells and tissues	High tensile strength; fibrous or filamentous structure; repetitive sequences	Collagen, keratin, actin, tubulin, elastin	3HR2 - Collagen triple helix 1ATN - Actin 1TUB - Tubulin
Regulatory Proteins	Control gene expression and cellular processes	DNA-binding domains; specific sequence recognition; influence transcription	Transcription factors (p53, NF-κB), steroid hormone receptors, tumor suppressors, oncoproteins	2OCJ - p53 tumor suppressor 1NFI - NF-κB 2Q6H - c-Myc transcription factor
Signaling Proteins	Mediate cell communication	Small size for diffusion; receptor binding domains; often secreted	Cytokines, growth factors (EGF, VEGF), hormones (insulin, growth hormone), G-proteins, kinases	1EGF - Epidermal growth factor 4INS - Insulin 3V23 - VEGF
Immune Proteins	Defend against pathogens	Antigen recognition; variable regions; complement cascade activation	Antibodies (IgG, IgA, IgM), complement proteins, cytokines	1IGT - IgG antibody 1HFI - Immunoglobulin Fab fragment 2XQW - Complement C3
Storage Proteins	Store amino acids and ions	High capacity for binding; regulated release; often in vesicles or granules	Ferritin (iron), casein (amino acids), ovalbumin (nutrients)	1FHA - Ferritin 1OVA - Ovalbumin
Motor Proteins	Generate movement	ATP-binding sites; conformational changes drive motion; directional movement	Myosin (muscle contraction), kinesin, dynein (cargo transport)	2MYS - Myosin 3KIN - Kinesin motor domain 4RH7 - Dynein motor domain

⚠️ WARNING

You can run the code below to view the different proteins. Click on the rocket icon on top of the Jupyter-book and click Live Code. Wait for the kernel to load before running the codes

!pip install py3Dmol
import py3Dmol

# PDB IDs
proteins = {
    "3PTB": "Enzymes: Trypsin",
    "3RFM": "Receptors: A2A receptor",
    "1A3N": "Transport: Hemoglobin",
    "1BKV": "Structural: Collagen",
    "7EZJ": "Regulatory: p53",
    "4INS": "Signaling: Insulin",
    "1IGT": "Immune: IgG Antibody",
    "1FHA": "Storage: Ferritin",
    "2MYS": "Motor: Myosin"
}

# --- 1. Display the protein menu ---
print("--- Protein Selection Menu ---")
for pdb_id, description in proteins.items():
    print(f"  [{pdb_id}]: {description}")
print("------------------------------")

# --- 2. Get the protein input() ---
selected_id = input("Enter the 4-character PDB ID to render: ").upper().strip()

# --- 3. Validate choice and render the protein ---
if selected_id in proteins:
    print(f"\nLoading {selected_id} ({proteins[selected_id]})...")
    viewer = py3Dmol.view(query=f'pdb:{selected_id}', width=600, height=600)
    viewer.setStyle({'cartoon': {'color': 'spectrum'}})
    viewer.zoomTo()
    viewer.show()
else:
    print(f"\nError: '{selected_id}' is not a valid choice.")
    print("Please re-run the cell and select a PDB ID from the list.")

Major Protein Families in Drug Discovery#

Kinase superfamily contains over 500 human members that transfer phosphate groups from ATP to substrates. Divided into serine/threonine kinases, tyrosine kinases, and dual-specificity kinases. Kinase inhibitors are major cancer therapeutics (imatinib, erlotinib).
G-Protein Coupled Receptors (GPCR) superfamily includes ~800 human members with seven transmembrane helices. Divided into Class A (rhodopsin-like), Class B (secretin-like), Class C (glutamate-like), and others. GPCRs are the largest group of drug targets, targeted by ~34% of FDA-approved drugs.
Nuclear receptor superfamily contains 48 human members that are ligand-activated transcription factors. Includes steroid receptors, thyroid receptors, and orphan receptors. Targeted by drugs for inflammation, cancer, and metabolic diseases.
Protease families are enzymes that cleave peptide bonds. Classified by catalytic mechanism: serine proteases, cysteine proteases, aspartic proteases, and metalloproteases. Targets for antivirals (HIV protease inhibitors) and cardiovascular drugs (ACE inhibitors).
Ion channel superfamilies include voltage-gated channels (sodium, potassium, calcium), ligand-gated channels, and mechanically-gated channels. Critical drug targets for neurological, cardiovascular, and pain disorders.
Transporter families move molecules across membranes. ABC transporters use ATP, while solute carriers use electrochemical gradients. Important for drug absorption and resistance mechanisms.

Why is it important to classify proteins into families?#

Proteins within families share conserved active sites and binding pockets, which enables structure-based drug design but also creates selectivity challenges. Closely related paralogs may bind the same drug, causing off-target effects. However, subtle differences between family members can be exploited for selective inhibitor design. Understanding family relationships enables homology modeling to predict structures, virtual screening across family members, drug repurposing to related targets, and prediction of resistance mutations.

4. Summary and Key Takeaways#

In this section, we’ve explored how proteins are classified into families and superfamilies to understand drug action. By recognizing that shared ancestry leads to conserved structures, we can predict and explain why a drug may bind to multiple targets. This knowledge is a cornerstone of modern, structure-based drug design, moving beyond a single-target view to appreciate the complex network of interactions that define a drug’s true effect.

Proteins are functionally categorized as enzymes, receptors, transporters, structural proteins, regulatory proteins, signaling proteins, immune proteins, storage proteins, and motor proteins.
Structurally, they are classified as fibrous, globular, membrane, or intrinsically disordered proteins.
Major drug target families include kinases (>500 members), GPCRs (~800 members), nuclear receptors (48 members), proteases, ion channels, and transporters.
Understanding protein types and families is essential for identifying drug targets, predicting off-target effects, designing selective inhibitors, and repurposing drugs