Chapter 1: Protein Families and Classification#
1. Introduction#
This section explores how proteins are organized into families and superfamilies based on their evolutionary and structural relationships. Proteins are the primary targets for most therapeutic drugs, representing approximately 90% of current drug targets. Understanding how proteins are organized into families and classified is fundamental to drug discovery because proteins within the same family often share structural features, functional mechanisms, and binding sites. This knowledge allows researchers to predict drug-target interactions, identify new therapeutic targets, and repurpose existing drugs across related protein families.
2. Key Concepts and Definitions#
Protein Family: A group of proteins that share significant sequence similarity, structural features, and often similar functions, suggesting they evolved from a common ancestral protein.
Homology: The relationship between proteins that share a common evolutionary origin. Proteins sharing >30% sequence identity are generally considered homologous.
Domain: A distinct functional and structural unit within a protein that can evolve, function, and exist independently. Domains are the fundamental building blocks of proteins.
Superfamily: A broader grouping than a family, containing proteins with low sequence similarity but clear structural and functional relationships.
Ortholog:Proteins in different species that evolved from a common ancestral gene and typically retain the same function.
Paralog: Proteins within the same organism that arose through gene duplication and may have diverged in function.
3. Main Content#
3.1 Methods of Protein Classification#
Proteins are organized into hierarchical classification systems based on evolutionary and structural relationships.
Sequence-based families group proteins with significant sequence similarity (typically >30% identity). Tools like BLAST identify relationships through sequence alignment. Multiple sequence alignments reveal conserved motifs and functional residues. The Pfam database contains over 19,000 protein families defined by conserved sequence domains.
Structure-based classification organizes proteins by 3D structural features. The SCOP (Structural Classification of Proteins) database uses four levels: Class (overall secondary structure content), Fold (arrangement of secondary structures), Superfamily (probable evolutionary relationship), and Family (clear evolutionary relationship). CATH uses a similar hierarchy: Class, Architecture, Topology, Homology.
Domain-based classification recognizes that most proteins contain multiple functional domains. Common domains appear in various combinations across different proteins. InterPro integrates information from multiple databases to classify proteins by domains and functional sites.
Functional classification groups proteins by their biological or biochemical function regardless of sequence similarity. The Enzyme Commission (EC) number system classifies enzymes by reaction type. Gene Ontology (GO) provides standardized terms for molecular function, biological process, and cellular component.
3.2 Classification of Proteins by Function#
Class |
Function |
Key Features |
Examples |
PDB Structure |
|---|---|---|---|---|
Enzymes |
Catalyze biochemical reactions; largest functional class |
Active sites with catalytic residues; substrate specificity; cofactor requirements |
Oxidoreductases, transferases, hydrolases, lyases, isomerases, ligases; kinases, proteases, polymerases |
1ATP - ATP synthase |
Receptors |
Bind signaling molecules and transmit signals across membranes or within cells |
Ligand-binding domains; conformational changes upon binding; signal transduction capability |
GPCRs, RTKs, nuclear receptors, ion channel receptors, cytokine receptors |
3SN6 - β2-adrenergic receptor (GPCR) |
Transport Proteins |
Move molecules across membranes or through body fluids |
Selective binding pockets; conformational changes for transport; gating mechanisms |
ABC transporters, solute carriers, hemoglobin, albumin, ion channels |
1HHO - Hemoglobin |
Structural Proteins |
Provide mechanical support and shape to cells and tissues |
High tensile strength; fibrous or filamentous structure; repetitive sequences |
Collagen, keratin, actin, tubulin, elastin |
|
Regulatory Proteins |
Control gene expression and cellular processes |
DNA-binding domains; specific sequence recognition; influence transcription |
Transcription factors (p53, NF-κB), steroid hormone receptors, tumor suppressors, oncoproteins |
2OCJ - p53 tumor suppressor |
Signaling Proteins |
Mediate cell communication |
Small size for diffusion; receptor binding domains; often secreted |
Cytokines, growth factors (EGF, VEGF), hormones (insulin, growth hormone), G-proteins, kinases |
|
Immune Proteins |
Defend against pathogens |
Antigen recognition; variable regions; complement cascade activation |
Antibodies (IgG, IgA, IgM), complement proteins, cytokines |
1IGT - IgG antibody |
Storage Proteins |
Store amino acids and ions |
High capacity for binding; regulated release; often in vesicles or granules |
Ferritin (iron), casein (amino acids), ovalbumin (nutrients) |
|
Motor Proteins |
Generate movement |
ATP-binding sites; conformational changes drive motion; directional movement |
Myosin (muscle contraction), kinesin, dynein (cargo transport) |
2MYS - Myosin |
!pip install py3Dmol
import py3Dmol
# PDB IDs
proteins = {
"3PTB": "Enzymes: Trypsin",
"3RFM": "Receptors: A2A receptor",
"1A3N": "Transport: Hemoglobin",
"1BKV": "Structural: Collagen",
"7EZJ": "Regulatory: p53",
"4INS": "Signaling: Insulin",
"1IGT": "Immune: IgG Antibody",
"1FHA": "Storage: Ferritin",
"2MYS": "Motor: Myosin"
}
# --- 1. Display the protein menu ---
print("--- Protein Selection Menu ---")
for pdb_id, description in proteins.items():
print(f" [{pdb_id}]: {description}")
print("------------------------------")
# --- 2. Get the protein input() ---
selected_id = input("Enter the 4-character PDB ID to render: ").upper().strip()
# --- 3. Validate choice and render the protein ---
if selected_id in proteins:
print(f"\nLoading {selected_id} ({proteins[selected_id]})...")
viewer = py3Dmol.view(query=f'pdb:{selected_id}', width=600, height=600)
viewer.setStyle({'cartoon': {'color': 'spectrum'}})
viewer.zoomTo()
viewer.show()
else:
print(f"\nError: '{selected_id}' is not a valid choice.")
print("Please re-run the cell and select a PDB ID from the list.")
Major Protein Families in Drug Discovery#
Kinase superfamily contains over 500 human members that transfer phosphate groups from ATP to substrates. Divided into serine/threonine kinases, tyrosine kinases, and dual-specificity kinases. Kinase inhibitors are major cancer therapeutics (imatinib, erlotinib).
G-Protein Coupled Receptors (GPCR) superfamily includes ~800 human members with seven transmembrane helices. Divided into Class A (rhodopsin-like), Class B (secretin-like), Class C (glutamate-like), and others. GPCRs are the largest group of drug targets, targeted by ~34% of FDA-approved drugs.
Nuclear receptor superfamily contains 48 human members that are ligand-activated transcription factors. Includes steroid receptors, thyroid receptors, and orphan receptors. Targeted by drugs for inflammation, cancer, and metabolic diseases.
Protease families are enzymes that cleave peptide bonds. Classified by catalytic mechanism: serine proteases, cysteine proteases, aspartic proteases, and metalloproteases. Targets for antivirals (HIV protease inhibitors) and cardiovascular drugs (ACE inhibitors).
Ion channel superfamilies include voltage-gated channels (sodium, potassium, calcium), ligand-gated channels, and mechanically-gated channels. Critical drug targets for neurological, cardiovascular, and pain disorders.
Transporter families move molecules across membranes. ABC transporters use ATP, while solute carriers use electrochemical gradients. Important for drug absorption and resistance mechanisms.
Why is it important to classify proteins into families?#
Proteins within families share conserved active sites and binding pockets, which enables structure-based drug design but also creates selectivity challenges. Closely related paralogs may bind the same drug, causing off-target effects. However, subtle differences between family members can be exploited for selective inhibitor design. Understanding family relationships enables homology modeling to predict structures, virtual screening across family members, drug repurposing to related targets, and prediction of resistance mutations.
4. Summary and Key Takeaways#
In this section, we’ve explored how proteins are classified into families and superfamilies to understand drug action. By recognizing that shared ancestry leads to conserved structures, we can predict and explain why a drug may bind to multiple targets. This knowledge is a cornerstone of modern, structure-based drug design, moving beyond a single-target view to appreciate the complex network of interactions that define a drug’s true effect.
Proteins are functionally categorized as enzymes, receptors, transporters, structural proteins, regulatory proteins, signaling proteins, immune proteins, storage proteins, and motor proteins.
Structurally, they are classified as fibrous, globular, membrane, or intrinsically disordered proteins.
Major drug target families include kinases (>500 members), GPCRs (~800 members), nuclear receptors (48 members), proteases, ion channels, and transporters.
Understanding protein types and families is essential for identifying drug targets, predicting off-target effects, designing selective inhibitors, and repurposing drugs