Figure – 1
The Human Transcription Factor Repertoire
(A) Schematic of a prototypical TF.
(B) Number of TFs and motif status for each DBD family. Inset displays the distribution of the number of C2H2-ZF domains for classes of effector domains (KRAB, SCAN, or BTB domains); “Classic” indicates the related and highly conserved SP, KLF, EGR, GLI GLIS, ZIC, and WT proteins.
(C) DBD configurations of human TFs. In the network diagram, edge width reflects the number of TFs with each combination of DBDs.
(D) Number of auxiliary (non-DNA-binding) domains (from Interpro) present in TFs, broken down by DBD family.
Transcription factors (TFs) recognize specific DNA sequences to control chromatin and transcription, forming a complex system that guides expression of the genome. Despite keen interest in understanding how TFs control gene expression, it remains challenging to determine how the precise genomic binding sites of TFs are specified and how TF binding ultimately relates to regulation of transcription. This review considers how TFs are identified and functionally characterized, principally through the lens of a catalog of over 1,600 likely human TFs and binding motifs for two-thirds of them. Major classes of human TFs differ markedly in their evolutionary trajectories and expression patterns, underscoring distinct functions. TFs likewise underlie many different aspects of human physiology, disease, and variation, highlighting the importance of continued effort to understand TF-mediated gene regulation.
Transcription factors (TFs) directly interpret the genome, performing the first step in decoding the DNA sequence. Many function as “master regulators” and “selector genes”, exerting control over processes that specify cell types and developmental patterning (Lee and Young, 2013) and controlling specific pathways such as immune responses (Singh et al., 2014). In the laboratory, TFs can drive cell differentiation (Fong and Tapscott, 2013) and even de-differentiation and trans-differentiation (Takahashi and Yamanaka, 2016). Mutations in TFs and TF-binding sites underlie many human diseases. Their protein sequences, regulatory regions, and physiological roles are often deeply conserved among metazoans (Bejerano et al., 2004, Carroll, 2008), suggesting that global gene regulatory “networks” may be similarly conserved. And yet, there is high turnover in individual regulatory sequences (Weirauch and Hughes, 2010), and over longer timescales, TFs duplicate and diverge. The same TF can regulate different genes in different cell types (e.g., ESR1 in breast and endometrial cell lines [Gertz et al., 2012]), indicating that regulatory networks are dynamic even within the same organism. Determining how TFs are assembled in different ways to recognize binding sites and control transcription is daunting yet paramount to understanding their physiological roles, decoding specific functional properties of genomes, and mapping how highly specific expression programs are orchestrated in complex organisms.
This review considers our current understanding of TFs and their global functions to provide context for thinking about how TFs work individually and as an ensemble. We also provide a catalog of the human TF complement and a comprehensive assessment of whether a DNA-binding motif is known for each TF. We use this catalog to survey human TF function, expression, and evolution, highlighting the roles played by TFs in human disease, including the effect of variation within TF proteins and TF-binding sites. A comprehensive review of ∼1,600 proteins is impossible; instead, we attempt to exemplify emerging trends and techniques, as well as shortcomings in existing data.
Historically, the term transcription factor has been applied to describe any protein involved in transcription and/or capable of altering gene-expression levels. In the current vernacular, however, the term is reserved for proteins capable of (1) binding DNA in a sequence-specific manner and (2) regulating transcription (Figure 1A) (Fulton et al., 2009, Vaquerizas et al., 2009). TFs can have 1,000-fold or greater preference for specific binding sequences relative to other sequences (Damante et al., 1994, Geertz et al., 2012). Because TFs can act by occluding the DNA-binding site of other proteins (e.g., the classic lambda, lac, and trp repressors [Ptashne, 2011]), the ability to bind to specific DNA sequences alone is often taken as an indicator of ability to regulate transcription.
These proteins cannot be understood functionally without accompanying detailed knowledge of the DNA sequences they bind. TF DNA-binding specificities are frequently summarized as “motifs”—models representing the set of related, short sequences preferred by a given TF, which can be used to scan longer sequences (e.g., promoters) to identify potential binding sites. Determining a DNA-binding motif is often the first step toward detailed examination of the function of a TF because identification of potential binding sites provides a gateway to further analyses. Our ability to generate both motifs and genomic binding sites has improved dramatically over the last decade, leading to an unprecedented wealth of data on TF-DNA interactions. To develop the current TF catalog, we have drawn heavily upon motif collections such as TRANSFAC (Matys et al., 2006), JASPAR (Mathelier et al., 2016), HT-SELEX (Jolma et al., 2013, Jolma et al., 2015, Yin et al., 2017), UniPROBE (Hume et al., 2015), and CisBP (Weirauch et al., 2014), along with previous catalogs of human TFs (Fulton et al., 2009, Vaquerizas et al., 2009, Wingender et al., 2015).
There is typically only a partial overlap between experimentally determined binding sites in the genome and sequences matching the motif; moreover, even experimentally determined binding sites are relatively poor predictors of genes that the TFs actually regulate (Cusanovich et al., 2014). At the same time, motif matches are often among the most enriched sequences in a ChIP-seq (chromatin immunoprecipitation sequencing) dataset, indicating that intrinsic DNA-binding specificity is important for TF binding in vivo. In retrospect, this outcome should have been expected: most TF-binding sites are small (usually 6–12 bases) and flexible, so a typical human gene (>20 kb) will contain multiple potential binding sites for most TFs (Wunderlich and Mirny, 2009). Well-established concepts such as cooperativity and synergy between TFs provide a ready solution to this deficit in specificity—most human TFs have to work together to get anything done—but the details of their interactions and relationships are generally lacking. The biochemical effects of TFs subsequent to binding DNA are also largely unmapped and known to be context dependent. As a result, decoding how gene regulation relates to TF-binding motifs and gene sequences remains a major practical challenge; the resulting frustration has been embodied in the term “futility theorem” (Wasserman and Sandelin, 2004).