Range-Limited Heaps’ Law for Functional DNA Words in the Human Genome (2024)

Wentian Li^1,2, Yannis Almirantis³, Astero Provata⁴
1. Department of Applied Mathematics and Statistics, Stony Brook University,Stony Brook, NY, USA¹¹1Current address.
2. The Robert S. Boas Center for Genomics and Human Genetics
The Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
3. Theoretical Biology and Computational Genomics Laboratory, Institute of Bioscience and Applications
National Center for Scientific Research “Demokritos”, 15341 Athens, Greece
4. Statistical Mechanics and Dynamical Systems Laboratory, Institute of Nanoscience and Nanotechnology
National Center for Scientific Research, “Demokritos”, 15341 Athens, Greece

(June 17, 2024)

Abstract

Heaps’ or Herdan’s law is a linguistic law describing the relationshipbetween the vocabulary/dictionary size (type) and word counts (token) to be apower-law function. Its existence in genomes with certain definition ofDNA words is unclear partly because the dictionary size in genomecould be much smaller than that in a human language. Wedefine a DNA word as a coding region in a genome that codes for a proteindomain. Using human chromosomes and chromosome arms as individual samples,we establish the existence of Heaps’ law in the human genome withinlimited range. Our definition of words in a genomic or proteomic context is differentfrom other definitions such as over-represented k-mers which are much shorter in length.Although an approximate power-law distribution of protein domain sizesdue to gene duplication and the related Zipf’s law is well known, theirtranslation to the Heaps’ law in DNA words is not automatic.Several other animal genomes are shown herein also to exhibit range-limited Heaps’ lawwith our definition of DNA words, though with various exponents.When tokens were randomly sampled and sample sizes reach tothe maximum level, a deviation from the Heaps’ lawwas observed, but a quadratic regression in log-log type-token plot fits thedata perfectly. Investigation of type-token plot and its regression coefficients couldprovide an alternative narrative of reusage and redundancy of protein domainsas well as creation of new protein domains from a linguistic perspective.

Abbreviations:AIC: Akaike information criterion;CRCPD: coding regions that code for protein domains;GC content: percentage of guanine and cytosine bases.

1 Introduction

The study of molecular biology is full of examples in terminology, and inquantitative and empirical laws, borrowed from linguistics (Key, 2000).Jargons used in molecular biology, such as transcription, translation, code, etc. areoften borrowed from terms in linguistics, not to mention the ubiquitous use of letters(nucleotides and amino acids) and text (DNA and protein sequences). The recent Nobel-prize-winningbiotechnique of CRISPR was described as an equivalence of a text editor (Nelson et al., 2015).Not surprisingly, linguistic laws have also seen their mentioning in genomics.For example, Zipf’s law (Zipf, 1935) which describes how the word frequencyin a text drops off with its rank, was tested in DNA sequences with k-mers/k-tuples(e.g. k=6) considered as “word” (Mantegna et al., 1994). This study was criticized,putting aside the issue of thedefinition of DNA words, because the rank-frequency plot did not show a good power-lawas in the case of human languages (Konopka and Martindale, 1995; Li, 2002). Another example of linguisticlaw is the Menzerath’s law (Menzerath, 1928), which describes the tendency forusing smaller linguistic units when these units are more numerous. The Menzerath’slaw at the human exon level is the tendency for genes with more exons to havesmaller exon sizes (Li, 2011a; Nikolaou, 2014), while other choices ofunits are also investigated (Semple et al., 2022).

In this study, we focus on another linguistic law, the Heaps law (Heaps, 1978)(or Herdan-Heaps law (Egghe, 2007)), which states that the number of unique words(vocabulary/dictionary size) as a function of text length is a power-law.Heaps’ law can be understood as a (number of)type-(number of)token(Herdan, 1960; Wetzel, 2009) relationship: type being a unique word, tokenbeing usage of a word, i.e., the total number of types is the vocabularysize, and total number of tokens is the vocabulary usage. Heaps’ law canalso be understood as a law of diminishing return (Petersen et al., 2012).If one searches longer and longer texts for the purpose offinding new words, Heaps’ law indicates that the chance of finding new wordsbecomes smaller and smaller as more texts are looked over. The latter meaningin Heaps law has been implied in the number of genes or gene-families discoveredby sequencing more and more species’ genomes in a clade (Medini et al., 2020),or the number of single-nucleotide-polymorphisms (SNPs) discovered by sequencingmore samples (Ionita-Laza et al., 2009).

In order to apply a linguistic law to genomics, one first needs to findan entity corresponding to a word. A common practice is to consider (overlapping) k-merswith a prefixed k value as “words” due to its easiness to calculate, its use inalignment-free sequence comparison (Zielezinski et al., 2017; Li et al., 2019), and its importance in manybioinformatics algorithms (Chikhi and Medvedev, 2013; Koren et al., 2017; Rahman et al., 2018).However, the problem with all fixed-length k-mers as words isthat k-mers in general lack functional meaning.

Instead of all k-mers, attempts were made to selectonly the over-represented k-mers, based on the frequency expectationfrom (k-1)-mers, as words or motifs (Brendel et al., 1986; Phillips, et al., 1987; Vilo, 2002; Apostolico et al., 2003; Gatherer, 2007).There are other motif detection methods based on statistical mechanics(Bussemaker et al., 2000; Moghaddasi et al., 2017). However, these are postulated units only.In fact, non-coding regions tend to have relatively more over-represented oligonucleotidesthan coding regions (Frontali and Pizzi, 1999). Many of over-representedoligonucleotides in noncoding regions may be due toerror accumulation (e.g. slippage errors) tolerated by purifyingselection, but some others may have functional implications and thus deserve the name“words”, as the mirror-symmetrical words found in introns (Brendel et al., 1986).

In the framework of semiotics (Hoopes, 2011), word is part of the signifier-signified pairthat represents a meaning. Without a biological meaning, a proposed DNA or protein wordwould be much less interesting if we continue to discuss linguistic laws.Although the in-frame non-overlapping 3-mers within exons (codons) do have abiological meaning (Mukhopadhyay et al., 2006), k-mers, even theover-represented ones, may or may not.

Recently, large language models (Brant et al., 2007), such as bidirectional encoder representationsfrom transformers (BERT) (Devlin et al., 2018) or generative pre-trained transformer(GPT) (Radford et al., 2019), have been applied to biomolecular sequences, in particular proteinsequences (Rao et al., 2020; Madani et al., 2023; Nijkamp et al., 2023). One aspect of natural language process is tokenizationwhich means to delineate a character string by boundaries (Webster and Kit, 1992). For proteinsequences, the tokenization is a process to find “words” (Ofer et al., 2021). The protein “words”found by this method are usually short, with average size of only a few (e.g. four)amino acids (Ferruz et al., 2022; Dotan et al., 2024).

In this study, we use DNA segments that code a protein domain (Wang et al., 2021) as our DNA“word tokens”. Treating protein domains as protein words has been proposed before(Searls, 2002; Gimona, 2006; Scaiewicz and Levitt, 2015; Yu et al., 2019; Buchan and Jones, 2020).There were studies that considergene internal structure in DNA sequences as “grammar” or rules (Dong and Searls, 1994),but the emphasis of these were not on the definition of words.We prefer to define DNA words as substrings of gene sequencesdelimited by the boundaries of the protein domains within the proteins encodedby those genes.By doing so, our definition of DNA words has a biological meaning,i.e. coding for a protein domain.

To investigate if Heaps’ law holds in the human genome, our task is to findall Coding Regions that Code for a Protein Domain (CRCPD). For this weneed to, for any genomic region, count the number of all CRCPDs (number of tokens)and the number of distinct CRCPDs among them (number of types).For one chromosome of the human genome, thispair of counts (#token, #type) will only give us one point. To see a relationshipbetween token and type, we repeat the same counting process for each one ofthe 23 chromosomes and/or each one of the chromosome arms, so more pointsare produced. This may be chosen if more points are desired, aschromosomal arms may be seen as relatively independent entities,and different from the whole chromosomes where they belong.

It has been known for over twenty years that the histogram of protein domainsor distinct parts follow an inverse power-law distribution in genomes(Qian et al., 2001; Luscombe et al., 2002; Koonin et al., 2002).Another well known property isthe importance role that gene duplicationplays in shaping the distribution of these protein domainsas well as other genomic units (Müller et al., 2002; Gao and Miller, 2011; Li et al., 2016), echoing theoretical studies linking duplication andpower-law behavior (Li, 1991; Ispolatov et al., 2005). A perfectinverse power-law histogram (distribution) will lead to a similarinverse power-law with rank as the x-variable, i.e., the Zipf’s law (Newman, 2005).

However, more and more studies show that when tokensare randomly chosen from a dictionary with a finite size whose wordsfollow Zipf’s law distribution, the resulting type-token plot doesnot follow an exact Heaps’ law (Bernhardsson et al., 2009; Font-Clos and Corral, 2015).In particular, the type-token plot in log-log scale would showa concave downward shape (Font-Clos and Corral, 2015). Nevertheless, within a limited range,this concave curve might be approximately viewed as a straight line.Also, publications with proofs on relations between twopower-law exponents in the two laws rely on certain assumptions,e.g. infinite dictionary size, and working in the infinite token limit(Boytsov, 2017). All these observations imply that the power-lawdistribution in protein domain known for the last twenty yearsdoes not automatically lead to a power-law type-token plot for CRCPD,which needs to be checked from the data directly.

In the next section, we describe the data and the methodology used.In section 3, we present evidence of the existence of range-limitedHeaps’ law in the human genome; while we point out one chromosome(chromosome 19) as an outlier to the Heaps’ law;We also include in our study another level of genomic entities,chromosome arms, to explore the extent of Heaps’ law.In section 4, we examine the potential Heaps’ law in other animal genomes.In section 5, we generate artificial units byrandomly sampling DNA word tokens from the list of known DNA wordtypes, and confirm the previous results of log-log concave downward andfurther reconfirm a systematic deviation from the Heaps’ law.The same random sampling process is also carried out atthe individual chromosome level, showing a distinction betweenthe concavity and finite dictionary size effect.Finally, various points are discussed in the Discussionsection, including a comparison between thecreation of new words in human languages and evolution ofprotein domains in genomes.

2 Data and Methods

The protein domains used in the present study refer to Pfam(Mistry et al., 2021) (https://pfam.xfam.org/ or https://pfam-legacy.xfam.org/,which is hosted by InterPro (Paysam-Lafosse et al., 2022) after January 2023:https://www.ebi.ac.uk/interpro).The chromosomal locations of CRCPD in human genes are availablefrom the University of California at Santa Cruz (UCSC) Genome Browser track:Pfam domains in GENECODE genes.A description ofthe methods used to derive this track can be found on UCSC Genome Browser site(https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=ucscGenePfam).

To obtain this Pfam domains track in the human genome (for GRCh38/hg38),from the top bar of the genome browser, we select Tools $\rightarrow$ Table Browser,group = “Genes and Gene Predictions”,track = “Pfam in GENECODE”. The resulting table contains chromosome number,chromosomal location (start, end, in bp), name of the Pfam domain, etc.

In this study, we also used chromosome arms as units.To determine the chromosome arm to which a region belongs to, we downloadthe cytoband information from:
https://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/cytoBand.txt.gzto find the centromere locations in each chromosome.The counting of Pfam domains (both token and type) was carried out byR (https://www.r-project.org) scripts written by us.

The Heaps’ law is written as ( $y$ = vocabulary size = #type, or the number of uniqueCRCPD domains in a chromosome, $x$ =vocabulary usage=#token, or number of times any of theseCRCPD appear in the same chromosome):

y=Cx^{\alpha}

(1)

which after a logarithmic transformation on both side ( $y^{\prime}=log(y)$ , $x^{\prime}=log(x)$ , $C^{\prime}=log(C)$ ): $y^{\prime}=C^{\prime}+\alpha x^{\prime}$ becomes a linear relationship. The exponent $\alpha$ is determined by a linearregression in R (the function lm).For model selection the Akaike information criterion (AIC)(Akaike, 1974; Li and Nyholt, 2001) calculation is carried out by the AICcmodavg package(cran.r-project.org/web/packages/AICcmodavg/), using the functionaictab. For quadratic regression, we fit the equation $y^{\prime}=C+\alpha x^{\prime}+\beta x^{\prime 2}$ .

The following notation may gives readers an idea of the type of datawe have. Suppose thereare N tokens (appearance of CRCPDs: coding for Pfam domains), and K unique ones. $N=\sum_{i=1}^{K}P_{k}$ , where $P_{k}$ is the multiplicity of word type- $k$ .For any genomic region (chromosome, chromosome arm, or half-arm), we countthe number of tokens $x_{i}$ ( $i$ is the index for sample point) and $y_{i}$ of them are unique. When we have enough samplings $i=1,2,\cdots,I$ ,these ( $x_{i}$ , $y_{i}$ ) pairs can be log-transformed and used for a linear regression.We may not be constrained by genomic regions, but sample $x_{i}$ tokens genome-widefrom the pool of N tokens (sampling without replacement) and among them $y_{i}$ onesare unique. This can be repeated I times with ever larger $x_{i}$ values.The same ( $x_{i}$ , $y_{i}$ ) pairs can be used to check Heaps’ law.

3 Results for the human genome

3.1 Heaps’ law for Pfam CRCPD holds true

The number of Pfam domain appearance,as our example of CRCPD, in each autosomal and X chromosomein the human genome (version hg38) is obtained from the Pfam-in-GENECODE trackof the UCSC genome browser. This would produce 23 points for checking thevocabulary size (type) and vocabulary usage (token) relationship and a regression.In order to increase the sample size,we also consider data from the p- and q-arm in a chromosome, as partitioned by the centromerefor each metacentric (non-acrocentric) chromosome (41 points).

Fig.1(A) shows the number of types (number of unique/distinct Pfam domains)as a function of number of tokens (Pfam domain counts) for 23 human chromosomesin log-log scale. With the exception of one point, all other points scatteraround a straight-line, indicating a power-law and Heaps’ law. The outlieris chromosome 19, being lower in number of CRCPD types or higher innumber of CRCPD tokens. The regression for log-token over log-type produces theslope (scaling exponent) $\alpha=0.658$ and $C=5.12$ in Eq.(1).

Fig.1(B) shows a similar type-token plot in log-log scale usingchromosome arm data. The regression results are similar: $\alpha=0.683$ and $C=3.97$ .Using both whole chromosome and chromosome arm data leads to Fig.1(C),with $\alpha=0.695$ and $C=3.7$ . Figure 1(D) shows the same dataas Fig.1(C) but in linear-linear scale. It is clear that power-lawcurve function fits the data better than a linear regression. In fact,the variance explained by regression is much higher in log-log scale( $R^{2}=0.887,0.873,0.9$ for chromosome data, chromosome arm data, andcombined) than in linear-linear scale ( $R^{2}=0.697,0.667,0.738$ ).

Range-Limited Heaps’ Law for Functional DNA Words in the Human Genome (1)

Removing the outliers (one point in Fig.1(A), two points in Fig.1(B)for two chromosome 19 arms, and three points in Fig. 1(C)) resultin even better $R^{2}$ : 0.968, 0.955, and 0.967. The slope of regressionwithout outliers is up slight: $\alpha=0.754,0.772,0.770$ , and $C$ is down toaround 2.

3.2 The protein domains that are mostly responsible for the redundancy

Apparently, more tokens than types meansthat some CRCPDs appear multiple times in a chromosome. WhichCRCPDs appear more than others? Table 1shows the top three Pfam domains in each human chromosome, as well as in the whole genome,according to their frequency. On the whole genome level, zinc finger C2H2 type (zf-C2H2),a C2H2-type zinc finger (zf-C2H2_6), and Immunoglobulin I-set domain (I-set) arethe most frequent domains, appearing 5493, 1609, and 900 times, respectively.Genomewide there are 6839 Pfam domain names, whose ID and annotation can befound inhttp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gzor http://pfam-legacy.xfam.org/family/browse .

On individual chromosome level, zf-C2H2 domain (Miller et al., 1985) is among the top 3 domainsfor 19 out of 23 chromosomes (not on the top 3 for chromosomes 2,11,14, and 22).Other top ranking domains on individual chromosome level includeolduvai domain (Chr1), immunoglobulin I-set domain (Chr2), cadherin domain (Chr4,5),7 transmembrane receptor (rhodopsin family) (Chr11), collagen triple helix repeat(Chr13), immunoglobulin V-set domain (Chr14, 22), and keratin high sulfur B2 protein (Chr21).

ch	most_freq CN	2nd CN	3rd CN	num_type	per(CN=1)	mean(CN)	max(CN)
1	Olduvai 215	zf-C2H2 214	Sushi 186	1759	64.7%	3.597	215
2	I-set 224	fn3 156	Ig_3 136	1355	63.8%	3.373	224
3	zf-C2H2 217	WD40 72	Ig_3 70	1178	65.3%	2.93	217
4	Cadherin 101	zf-C2H2 92	Ank 50	857	67.1%	2.797	101
5	Cadherin 334	zf-C2H2 134	Cadherin_2 55	951	68.2%	3.121	334
6	zf-C2H2 162	fn3 62	zf-C2H2_6 57	1057	65.8%	2.807	162
7	zf-C2H2 324	zf-C2H2_6 91	V-set 58	957	66.1%	2.883	324
8	zf-C2H2 171	Sushi 56	Ank 52	787	67.9%	2.663	171
9	zf-C2H2 144	I-set 57	Ig_3 57	856	67.4%	2.974	144
10	zf-C2H2 103	Ank 72	Ank_3 44	854	66%	2.56	103
11	7tm_1 196	7tm_4 180	I-set 70	1136	65.8%	3.109	196
12	zf-C2H2 127	WD40 48	Ank 40	1100	65.8%	2.576	127
13	Collagen 28	Cadherin 21	zf-C2H2 19	475	69.7%	2.027	28
14	V-set 96	WD40 41	7tm_1 40	736	67.1%	2.455	96
15	zf-C2H2 57	EGF_CA 46	hEGF 41	727	68.1%	2.611	57
16	zf-C2H2 292	zf-C2H2_6 71	WD40 44	914	67.2%	2.534	292
17	zf-C2H2 107	Keratin_B2_2 73	WD40 42	1151	65.2%	2.487	107
18	zf-C2H2 75	Cadherin 45	Laminin_EGF 30	370	69.5%	2.473	75
19	zf-C2H2 2805	zf-C2H2_6 910	KRAB 228	1114	65.4%	6.684	2805
20	zf-C2H2 107	zf-C2H2_6 25	Laminin_EGF 21	674	72.6%	2.068	107
21	Keratin_B2_2 47	I-set 17	ig 16	291	69.4%	2.131	47
22	V-set 39	zf-C2H2 27	EGF_CA 16	590	68.1%	2.171	39
X	zf-C2H2 136	Collagen 32	MAGE 32	781	59.3%	2.77	136
all	zf-C2H2 5493	zf-C2H2_6 1609	I-set 900	6839	44.3%	9.109	5483

3.3 Chromosome 19 as outlier

Chromosome 19 (Chr19) has already been known to have the highest gene densitycompared to other chromosomes, more gene families, high GC content, higherdensity of repetitive sequences (Grimwood et al., 2004), extreme divergence with mousegenome (Castresana, 2002) but conserved within nonhuman primates (Harris et al., 2020).

We examine several chromosome level statistics to see which quantities makechromosome 19 an outlier. Chr19 has the highest GC content compared toother chromosomes, which might be related to the higher gene density,which itself is a product of widespread DNA/gene duplication.

If CRCPD statistics are examined directly, we find that theproportion of unique CRCPDin Chr19 does not stand out (see Table 1). Thenumber of CRCPD types on Chr19 is large, but not the largest.However, Chr19 has the highest average copy number per CRCPD, and alsohas the largest copy number for a CRCPD:zinc finger C2H2 domain appears 2805 times, which is an order of magnitudelarger than the maximum copy number in the other chromosomes.

3.4 Extension to half of the chromosome arms

The x-axis of Fig.1only spans one decade (ratio of the maximum and minimum token countis 12 for chromosomes and 17 for chromosome arms). In order to increasethe range of x-axis, we split each chromosome arm into two equally sizedregions. The type-token with this relatively local data being added is shown inFig.2.

Figure 2 shows that with more data at the low end of the x-axis(the largest token count is 59 times the smallest token count), Heaps’law remains to be true, with the fitting exponent $\alpha=0.75$ .Removing the outliers due to Chr19, $\alpha=0.8$ . Both of theseslopes are larger than those in Fig.1.

Visual inspection of Fig.2 seems to show a slight tendencyfor the points to curve down,which might explain the increase of the fitting exponent value.In fact, in the original version of Heaps’ law in Eq.(1), $C$ should be 1 as the line should go through $x=y=1$ points(when there is only one token, it must be unique). All our fitting $C$ values are larger than 1, meaning that in the log-log scale, type-tokenscatter plot will eventually deviate from a straight line towards the origin.We will return to this issue in section 5.

Range-Limited Heaps’ Law for Functional DNA Words in the Human Genome (2)

4 Robustness of the results

4.1 Random shuffle of CRCPDs

In order to check if the sampling units of chromosomes, chromosome arms,or half arms, play any role in the observed power-law trend in Figs.1,2,we carried out two experiments. The first is that we shuffle the CRCPDs (tokens)genome-wide once, while keeping the chromosome location intact. Then we repeat thesame CRCPD-type count in the same chromosomes, chromosome arms, and half arms.In the next subsection, we will randomly sample CRCPDs without replacement.Then the CRCPD-type count is plotted against the number of CRCPD token sampled.

Fig.3 shows the type-token relationship when CRCPDs areshuffled in the whole genome. The black dots are identical to those inFig.2, some dots represent chromosomes, others represent chromosomearms or half arms. The red dots are the similar type counts in the samegenomic segments (chromosomes, arms, half arms) when the shuffling hastaken place.

It can be seen from Fig.3 that after shuffling, each genomicunit tends to have more CRCPD types than the original data (red dots beingabove black dots). It means that CRCPD redundancy tends to be locallyclustered. However, the shuffling of CRCPDs does not change the factthat type and token are in a power-law relationship.The regression coefficient is 0.747 for the shuffled data, very closeto the original values of 0.747 (or 0.8 when Chr19 is removed)for the original data.

Range-Limited Heaps’ Law for Functional DNA Words in the Human Genome (3)

4.2 Random sampling from CRCPD pooland a systematic deviation from the power-law

In the second approach to test the robustness of ourresult, we create artificial sampling units by randomly picking CRCPDtokens from the genome. This new way of sampling tokens would freeus from the restriction of chromosomes or any genomic regions. Also,the range of x-axis in type-token plot can be greatly expanded.We pool CRCPDs from all human chromosomes(62299 tokens). Then we randomly sample x number of tokens and check howmany of them are unique (y number of types). A type-token plot from this randomsampling is shown in Fig.4(A) in log-log scale.Note that the x-range in Fig.4(A) is muchmore expanded compared to Figs.1,2.

If Heaps’ law holds exactly true, we should expecta straight line in the log-log plot in Fig.4. The linear regression(grey line) captures this trend (exponent 0.725 is less than one).However, a systematic deviation from the straight line, of a concave shapeat both ends of the line, can be easily seen. When we add a quadratic term,it fits the data perfectly (with the coefficient for the quadratic term tobe negative).The AIC of the quadratic regression, $-244.79$ , is much betterthan that for the linear regression ( $-93.08$ ), confirming the visual impression.This “log-log concavity” was previously reported(Bernhardsson et al., 2009; Lü et al., 2013; Font-Clos and Corral, 2015) and our DNA words result fully conforms it.For this reason, a better term for plots inFigs.1,2,6,4(A)could be “range-limited Heaps’ law”.

Since quadratic curves fit the type-token plot in Fig.4(A) nicely,the coefficient of the quadratic term is a reliable measure of the concavity.We investigate the impact of limited dictionary size on the concavity,or deviation from the Heaps’ law, by removing rare CRCPD.Fig.4(B) shows the fittingcoefficient of the quadratic term at these situations: half of the singletonCRCPD types (those that appear in the genome only once) are removed;all singleton CRCPD types are removed; all singleton plus half ofthe doublet (those CRCPD types that appear in the genome only twice)are removed; and all singletons and all doublets are removed.

Each point in Fig.4(B) represents a random samplingprocess similar to that in Fig.4(A), i.e., a sequenceof token sampling followed by type counting, when the type source (dictionary)is cut short. We ran the random sampling multiple times even with thesame dictionary size, because of the variation of random sampling, as wellas changing the upper limit of the x-axis to accommodate the reductionof dictionary size. Even with these variations, the trend is clear thatthe type-token plot becomes more curved if more rare words (singletonand doublet protein domains) are removed.

The random sampling of tokens can also be carried out on per chromosomebasis. Fig.5 shows the type-token plot for 23 individualchromosomes. Protein domains observed only within one chromosome arepooled, then randomly sampled without replacement, and distinct typesare counted. Chromosome 19 again stands out as the chromosome withthe lowest diversity and lesser number of types. The plateaus reachedfor each chromosome indicate that the dictionary size has been reached.This finite size effect is different from the slight convexity of thetype-token curve in Fig.4.

The lines/curves for individual chromosomes in Fig.5behave similarly in terms of slope before reaching the plateau.The y-intercept is very different for chromosome 19, consistentwith Fig.1 and Fig.2, while those ofother chromosomes are comparable. The shape of per-chromosometype-token scatter plots obtained by random sampling, which isroughly a power-law with a slight concave curving followed by a plateau,is a typical example of finite size effect. Due to the nature of generateddata, the number of tokens is not limited whereas the number of types isfinite, leading to plateaus.

Range-Limited Heaps’ Law for Functional DNA Words in the Human Genome (4)

Range-Limited Heaps’ Law for Functional DNA Words in the Human Genome (5)

5 Results for other animal genomes

5.1 Sampling from chromosomal units

Using the same UCSC GenomeBrowser, we downloaded the Pfam domain locations for mouse genome (GRCm39/mm39, Jun 2020,all 21 chromosomes), UniProt (The UniProt Consortium, 2012) domain information for chicken(GRCg6a/galGal6, 2018, missingchromosomes 29, 34-39), zebrafish (GRCz11/danRer11, 2017, all 25 chromosomes), anddrosophila melanogaster (BDGP release 6, 2014, 2L, 2R, 3L, 3R, 4, X and Y chromosomes/arms).

Figure 6 shows the type-token scatter plot in log-log scalewith points both representing a chromosome (black) or half of a chromosome (blue).The practice of using half chromosomes is to increase the sample sizeand to increase the covering range of the x-axis.

The mouse genome exhibits similar behavior as the human genome. The variance explainedby the power-law is $R^{2}=0.887$ , the regression slope is $\alpha=0.688$ and $C=3.8$ .The zebrafish genome has two chromosome outliers, chromosome 4 and 22,as well as half chromosome from chromosome 4. Removing these four pointslead to a power-law regression with $R^{2}=0.816$ and slope=0.643, $C=3.62$ .The chicken genome has two outliers at the chromosome level: Chr16 and Chr31,and 4 half-chromosome level outliers derived from these two chromosomes.After removing these six points, the regression in log-log scale has $R^{2}=0.937$ , slope=0.782, and $C=1.56$ . Finally, for drosophila genome, $R^{2}=0.993$ , slope=0.865, and $C=0.81$ .

To summarize, the results from these genomes where the protein domain trackis available, Heaps’ law is generally true with exponent $\alpha<1$ ,despite potentially data quality issues such as the UniProt domain information,limited range of x-axis, and limited number of sample points. We alsonotice that when the low-end of the x-axis for the scatter plot is closeto the origin (e.g. drosophila), the slope is relatively larger and closer to 1.Another observation is the difference of vocabulary sizesamong different genomes, with mouse having more protein domain typesthan chicken, zebrafish and drosophila. Word size differences between genomeswere also observed in (Gatherer, 2007), though for different definitionof words (over-represented protein substrings).

Range-Limited Heaps’ Law for Functional DNA Words in the Human Genome (6)

5.2 Random sampling from total CRCPD pool

Similar to subsection 4.2, we maycreate our own sampling units by randomly sampling CRCPD tokens from a genome.This step is even more important for other animal genomes than the humangenome, because the type-token plots in Fig.6 are morenoisy and are more limited in x-axis range.

Figure 7 shows the number ofCRCPD types within a randomly sampled CRCPD tokens (without replacement),as a function of number of tokens, for the four genomes (mouse,drosophila, chicken, and zebrafish). The trend of these type-tokenplots is very similar: quadratic regression lines fit the data perfectly,and within limited range, these quadratic lines can be approximatedas linear segments (thus Heaps’ law).

The quadratic nature of the type-token plot whenx-axis is extended reveals a weakness of using linear regression tofit the data in Fig.6. The slight difference betweenthe linear coefficients in the different genomes in Fig.6is not a reliable index to rank genomes. Figure 7, onthe other hand, shows a clear order, from zebrafish, to chicken and drosophila,then to mouse, following the trend of increase invocabulary sizes (number of distinct CRCPD types).The human data is also shown in Fig.7 which isessentially identical to that of mouse.

Range-Limited Heaps’ Law for Functional DNA Words in the Human Genome (7)

6 Discussion

Heaps (or Herdan-Heaps) law is one of the classiclinguistic “laws” (Altmann and Gerlach, 2016; Hernández-Fernández et al., 2019). The word “law” in this context is very different fromthat in other fields such as physics (e.g. Newton’s laws). Linguistic lawsare empirical, not supposed to be exact, and only describe a trend.As reported by several authors and confirmed in our own DNA word data,Heaps’ law or power-law relationship (or linear relation in log-log scale)is unlikely to be true for all range of text lengths (number of tokens)(Bernhardsson et al., 2009; Lü et al., 2013; Font-Clos and Corral, 2015). Even for human languages(e.g., some non-Indo-European languages), #type tends to flatten outaway from the power-law in the large # token end (Lü et al., 2013).At low #token end, the fact that the fitting parameter C $>$ 1 (or C’ $>$ 0), in our data,indicates that the power-law will not be extrapolated all the way to the origin.

Since the genome size is limited, the total number of protein-domain-codingloci is limited, therefore there is an upper limit in the #token axisas well as in the #type axis. Similarly,due to the large proportion ( $\sim$ 50%) of human genomebeing in non-coding intergenic regions without any CRCPDs,it is not practical to use a short genome region as it may have zero countof #token and #type.This leads to a lower limit in #token axis.Checking type-token relation in the human genome has to be range limited.On the other hand, our simulation in Fig.4 shows thatbeyond certain range, one should not expect power-law relationship betweentype and token anymore, and Heaps’ law will break down.

There are several publications relating exponents inZipf’s law and Heaps’ law asymptotically if boththe token frequency vs. rank and type vs. token are indeed inversepower-law and power-law (Baeza-Yates and Navarro, 2000). There are also papersderiving the asymptotic type-token relation in the modelof random sampling from a power-law distributed word pool(Eliazar, 2011; Gerlach and Altmann, 2013; Boytsov, 2017), in complicated mathematical expressionsoften in the form of infinite summation. With finite dictionarysize, infinite summation becomes finite summation, but stillno simple mathematical expression is proposed. Here, from a datafitting perspective, the type-token relation can be expressedby a quadratic equation: $\log(y)=c^{\prime}+\alpha\log(x)-\beta(\log(x))^{2}$ with $0<\alpha<1$ and $\beta>0$ . Using an extra termand extra parameter (Li et al., 2010; Li and Miramontes, 2011),in particular, using a quadratic/parabolic term (Frappat et al., 2003),to correct deviation from a straight line is a common practice.

Studies along the same lines with the present investigation include(Nasir et al., 2017; Caetano-Anollés, 2021). These works focus on phylogenetics ofdifferent species, and the word used is protein fold superfamilies (FSF).FSF information was provided by the SCOP (structural classification of proteins)database (Murzin et al., 1995; Andreeva et al., 2020).Even if SCOP domain superfamilies and Pfam domains are somewhat similar,the sample points in a type-token plot in (Nasir et al., 2017; Caetano-Anollés, 2021)are different from those in the current study. In (Nasir et al., 2017; Caetano-Anollés, 2021),each point represents a species whereas in our study a point is a chromosome orchromosome arm or a unit smaller than the whole genome.When all species are represented in one plot in (Nasir et al., 2017; Caetano-Anollés, 2021),including viruses, archaea, bacteria, eukaryotes, the scattering of the points (genomes)in log-log scale does not follow one single straight line, but a curve.Instead of treating the curve as consisting of multiplepiecewise linear regimes (Nasir et al., 2017; Caetano-Anollés, 2021), another possibility is to usea quadratic fitting function; admittedly, this does not provide an explanation forthe reason of deviation from the power-law trend, and it ignores somefundamental differences in genome evolution in different domains.

One key lesson from the Heaps’ law that the rateof new knowledge gain does not linearly increase with the effort,is very relevant to pan-genome studies (Tellelin et al., 2008) – the numberof new genes discovered does not increase linearly with the numberof genome sequenced. A different kind of diminishing return, concerningthe alignment of next-generation-sequencing reads when the readlength is increased, is discussed in (Li et al., 2014; Stephens and Iyre, 2018).

Besides the above work on protein fold superfamiliesin pan-genomes, Heaps’ law, unlike Zipf’s law (Sheinman et al., 2016), rarelyappears in the literature concerning genomics. In (Gatherer, 2007),when over-represented peptide substrings (over-represented k-merswith k in a range of values) are considered as words,linear regression is used to fit the type-token scatter plots. No attemptwas made to use the power-law function in (Gatherer, 2007). In(Mukhopadhyay et al., 2006), the fixed-length (k=3) codons are considered as words.This definition of words would easily reach the vocabulary limitwhen the text length is increased. Therefore, it is not surprisingthat a different type-token relation is expected.

We have confirmed the result of some previous works that if tokens are randomlysampled from a dictionary (type pool), while type frequency (token per type)follows the Zifp’s law or some modification of the Zipf’s law, theresulting type-token plots, though may have a look of power-law/Heaps’ law,are actually curved (Bernhardsson et al., 2009; Lü et al., 2013; Font-Clos and Corral, 2015). By simulation,we also numerically determined the impact of finite dictionary size,by removing the rare CRCPD types from consideration: the type-tokenplot becomes more curved with a reduction of dictionary size.The previous attempts to derive the type-token relation, with finiteor infinite dictionary size do not present the result in closed-form analyticexpression (Eliazar, 2011; Lü et al., 2013; Tunnicliffe and Hunter, 2022). Our quadratic regressionprovides a simple formula for an empirical description of the systematicdeviation from the Heaps’ law.

The definition of “type” or “dictionary word”, or in our case, Pfam domains,is supposed to be given by the database, and synonyms are not merged.For example, there are 13 protein domain names containing the prefix of zf-C2H2,including the top ranking zf-C2H2 (5493 appearances in the genome) andsecond ranking zf-C2H2_6 (1609 appearance in the genome). The reason ofnot combining potential synonyms is that it should be a feature ofa “dictionary”, and actually, Heaps’ law partially reflects this lexical diversity.If one would combine CRCPD that are “similar” (up to certain degree),the type-token plot and Heaps’ law in human languages should also be modifiedaccordingly.

As argued in (Lynch and Conery, 2003), a consequence of smaller population sizeis that selection pressure is much less effective in removing individualswithout a perfect fit for the given environment. As nonadaptive forces dominated,genomes became less compact, with more redundancies, and inefficiencies were tolerated.On the other hand, this near-neutral random walk in Wright’s landscape allowsa species to reach a new adaptive peak, potentially higher than the previous one(Wagner, 2019). Microorganisms have populations several orders of magnitude largerthan animals. The higher level of genome redundancy and complexity in animals,versus the lower level of genome complexity in microorganisms,is then expected. For a genome without redundancy, the number of token isroughly equal to the number of types, and Heaps exponent is one. With redundancy,exponent is expected to be less than 1.

It is suggested in (Caetano-Anollés et al., 2017; Caetano-Anollés, 2021) that viral, archaeal and bacterialproteomics have a higher level of vocabulary growth than eukarya, as indicatedby the higher value of $\alpha$ exponents (around 0.8). The $\alpha$ exponentfor eukarya genomes in (Caetano-Anollés, 2021) is only around 0.1, incontrast with our exponent $\alpha\sim 0.7$ within one genome.It is tempting to use the slope in linear regression torank different animal genomes in their level of duplication of protein domains.However, the limited x-range makes the estimation of slope less reliable.Also, when x-range is expanded, the linear (i.e., power-law when x and y arenot in log-scale) trend becomes quadratic, and there are three parametersto characterize the type-token function, not just one. The similarityof all three fitting parameters for the animal genomes in Fig.7make them more similar than different.

As a quantitative relationship, a Heaps’ law like formulacan provide a quick order of magnitude estimation of human genome parameters(Li, 2011b). For example, if there is one copy of CRCPD per gene,20000 genes in the human genome, and the Heaps’ law exponent is 0.7, then20000 ${}^{0.7}\approx$ 1000 protein domain types is estimated.With the factor C in Eq.1, we expect thenumber to be further multiplied. For example, if C=5, we expect 5000protein domains (the actual number of Pfam domain types is 6839).Even though these numbers are not exact, theyprovide a ballpark estimate.

Despite the deviation from a perfect power-law for type-tokenplot, or in an alternative wording, that the power-law Heaps’ law is onlytechnically correct in a limited range, there are other important lessonsin this study from the linguistic perspective. Just like human languageswith hundreds of new English words being created annually,albeit some of them are synonyms to the existing words,new protein domains have been generated during evolution.Even if the new protein domains may not be completely new,being at various degrees similar to the old ones, these maynot be that different from the new synonymous words in human languages.Some of the seemingly non-novel new protein domains may continue toevolve and eventually lead to a new meaning and a new structure.

Acknowledgment

We would like to thank Patrick Villanueva for participating to the initial stageof the project, Tom MacCarthy and Roman Samulyak for useful suggestions.

References

Akaike, (1974)H Akaike (1974),A new look at the statistical model identification,IEEE Trans. Automatic Control, 19:716-723.
Altmann and Gerlach, (2016)EG Altmann and M Gerlach (2016),Statistical laws in linguistics,in Creativity and Universality in Language, eds. M. Degli Esposti,E. Altmann, F. Pachet, pp.7-26 (Springer, Switzerland).
Andreeva et al., (2020)A Andreeva, E Kulesha, J Gough, AG Murzin (2020),The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures,Nucl. Acids Res., 48:D376-D382.
Apostolico et al., (2003)A Apostolico, ME Bock, S Lonardi (2003),Monotony of surprise and large-scale quest for unusual words,J. Comp. Biol., 10:283–311.
Baeza-Yates and Navarro, (2000)RA Baeza-Yates and G Navarro (2000), Block addressing indices forapproximate text retrieval,J. Am. Soc. Info. Sci, 51:69–82.
Bernhardsson et al., (2009)S Bernhardsson, LEC de Rocha, P Minnhagen (2009),The meta book and size-dependent properties of written language,New J. Phys., 11:123015.
Boytsov, (2017)L Boytsov (2017),A simple derivation of the Heap’s law from the generalized Zipf’s law,arXiv preprint, 1711.03066.
Brant et al., (2007)T Brants, AC Popat, P Xu, FJ Och, J Dean (2007),Large language models in machine translation,in Proc. 2007 Joint Conf. Empirical Methods InNatural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),ed. J Eisner, pp. 858–867 (Asso. Comp. Linguistics). URL: https://aclanthology.org/D07-1/
Brendel et al., (1986)V Brendel, JS Beckmann, EN Trifonov (1986),Linguistics of nucleotide sequences: morphology and comparisonof vocabularies,J. Biomol. Struc. Dyn., 4:11-21.
Buchan and Jones, (2020)DW Buchan and DT Jones (2020),Learning a functional grammar of protein domains using natural language word embedding techniques,Proteins, 88:616-624.
Bussemaker et al., (2000)HJ Bussemaker, H Li, ED Siggia (2000),Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis,Proc. Natl. Acad. Sci., 97:10096-10100.
Caetano-Anollés, (2021)G Caetano-Anollés (2021),The compressed vocabulary of microbial life,Front. Microbiol., 12:655990.
Caetano-Anollés et al., (2017)G Caetano-Anollés, BF Minhas, F Aziz, F Mughal, K Shahzad, G Tal, JE Mittenthal,D Caetano-Anollés, I Koç, A Nasir, K Caetano-Anollés, KM Kim (2017),The compressed vocabulary of the proteins of Archaea,Biocommunication of Archaea, ed. G Witzany, pp.147-174 (Springer, Switzerland).
Castresana, (2002)J Castresana (2002),Genes on human chromosome 19 show extreme divergence from the mouse orthologs anda high GC content,Nucl. Acids Res., 30:1751-1756.
Chikhi and Medvedev, (2013)R Chikhi and P Medvedev (2013),Informed and automated k-mer size selection for genome assembly,Bioinformatics, 30:31-37.
Devlin et al., (2018)J Devlin, MW Chang, K Lee, K Toutanova (2018),BERT: pre-training of deep bidirectional transformers for language understanding,arXiv preprint, doi: 10.48550/arXiv.1810.04805
Dong and Searls, (1994)S Dong and DB Searls (1994),Gene structure prediction by linguistic methods,Genomics, 23:540-551.
Dotan et al., (2024)E Dotan, G Jaschek, T Pupko, Y Belinkov (2024),Effect of tokenization on transformers for biological sequences,Bioinformatics, 40:btae196.
Egghe, (2007)L Egghe (2007),Untangling Herdan’s law and Heaps’ law: Mathematical and informetric arguments,J. Am. Soc. Info. Sci. Tech., 58:702-709.
Eliazar, (2011)I Eliazar (2011),The growth statistics of Zipfian ensembles: beyond Heaps’ law,Physica A, 390:3189-3203.
Ferruz et al., (2022)N Ferruz, S Schmidt, B Höcker (2022),ProtGPT2 is a deep unsupervised language model for protein design,Nature Comm., 13:4348.
Font-Clos and Corral, (2015)F Font-Clos and Á Corral (2015),Log-Log convexity of type-token growth in Zipf’s systems,Phys. Rev. Lett., 114:238701.
Frappat et al., (2003)L Frappat, C Minichini, A Sciarrino, P Sorba (2003),Universality and Shannon entropy of codon usage,Phys. Rev. E, 68:061910.
Frontali and Pizzi, (1999)C Frontali and E Pizzi (1999),Similarity in oligonucleotide usage in introns and intergenic regions contributesto long-range correlation in the Caenorhabditis elegans genome,Gene, 232:87-95.
Gao and Miller, (2011)K Gao and J Miller (2011),Algebraic distribution of segmental duplication lengths in whole-genome sequence self-alignments,PLoS ONE, 6:e18464.
Gerlach and Altmann, (2013)M Gerlach and EG Altmann (2013),Stochastic model for the vocabulary growth in natural languages,Phys. Rev. X, 3:021006.
Gatherer, (2007)D Gatherer (2007), Peptide vocabulary analysis reveals ultra-conservationand hom*onymity in protein sequences,Bioinf. Biol. Insights, 1:101-126.
Gimona, (2006)M Gimona (2006),Protein linguistics — a grammar for modular protein assembly?Nature Rev. Mol. Cell Biol., 7:68-73.
Grimwood et al., (2004)J Grimwood, LA Gordon, A Olsen, A Terry, J Schmutz,J Lamerdin, U Hellsten, D Goodstein, O Couronne, M Tran-Gyamfi,et al. (2004),The DNA sequence and biology of human chromosome 19,Nature, 428:529-535.
Harris et al., (2020)RA Harris, M Raveendran, KC Worley,J Rogers (2020),Unusual sequence characteristics of human chromosome 19 are conserved across 11 nonhuman primatesBMC Evol. Biol., 20:33.
Heaps, (1978)HS Heaps (1978),Information Retrieval: Computational and Theoretical Aspects(Academic Press, New York, USA).
Hernández-Fernández et al., (2019)A Hernández-Fernández, IG Torre, JM Garrido, L Lacasa (2019),Linguistic laws in speech: the case of Catalan and Spanish,Entropy, 21:1153.
Herdan, (1960)G Herdan (1960),Type-token Mathematics: A Textbook of Mathematical Linguistics(Mouton, The Hague, Netherlands).
Hoopes, (2011)J Hoopes, eds (1991),Peirce on Signs: Writings on Semiotic by Charles Sanders(University of North Carolina Press, Chapel Hill, NC, USA).
Ionita-Laza et al., (2009)I Ionit-Laza, C Lange, NM Laird (2009),Estimating the number of unseen variants in the human genome,Porc. Natl. Acad. Sci., 106:5008-5013.
Ispolatov et al., (2005)I Ispolatov, PL Krapivsky, A Yuryev (2005),Duplication-divergence model of protein interaction network,Phys. Rev. E, 71:061911.
Key, (2000)LE Kay (2000)Who Wrote the Book of Life? A History of the Genetic Code(Stanford University Press, Stanford, CA, USA).
Konopka and Martindale, (1995)AK Konopka and C Martindale (1995),Noncoding DNA, Zipf’s law, and language (letter),Science, 268:5212.
Koonin et al., (2002)EV Koonin, YI Wolf, GP Karev (2002),The structure of the protein universe and genome evolution,Nature, 420:218-223.
Koren et al., (2017)S Koren, BP Walenz, K Berlin, JR Miller, NH Bergman, AM Phillippy (2017),Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation,Genome Res., 27:722-736.
Li, (1991)W Li (1991),Expansion-modification systems: A model for spatial 1/f spectra,Phys. Rev. A, 43:5240-5260.
Li, (2002)W Li (2002),Zipf’s law everywhere,Glottometrics, 5:14-21.
Li, (2011a)W Li (2011a),Menzerath’s law at the gene-exon level in the human genome,Complexity, 17:49-53.
Li, (2011b)W Li (2011b),On parameters of the human genome,J. Theo. Biol., 288:92-104.
Li et al., (2016)W Li, O Fontanelli, P Miramontes (2016),Size distribution of function-based human gene sets and the split–merge model,Royal Soc. Open Sci., 3:160275.
Li et al., (2019)W Li, J Freudenberg, J Freudenberg (2019),Alignment-free approaches for predicting novel Nuclear Mitochondrial Segments (NUMTs)in the human genome, Gene, 691:141-152.
Li et al., (2014)W Li, J Freudenberg, P Miramontes (2014),Diminishing return for increased Mappability with longer sequencing reads:implications of the k-mer distributions in the human genome,BMC Bioinf.,15:2.
Li and Nyholt, (2001)W Li and DR Nyholt (2001),Marker selection by AIC and BIC,Genet. Epid., 21(suppl 1):S272-S277.
Li and Miramontes, (2011)W Li and P Miramontes (2011),Fitting ranked English and Spanish letter frequency distribution inUS and Mexican presidential speeches,J. Quant. Linguistics, 18:359-380.
Li et al., (2010)W Li, P Miramontes, G Cocho (2010),Fitting ranked linguistic data with two-parameter functions,Entropy, 12:1743-1764.
Lü et al., (2013)L Lü, ZK Zhang, T Zhou (2013),Deviation of Zipf’s and Heaps’ Laws in human languages with limited dictionary sizes,Sci. Rep., 3:1082.
Luscombe et al., (2002)NM Luscombe, J Qian, Z Zhang, T Johnson, M Gerstein (2002),The dominance of the population by a selected few: power-law behaviour applies to a widevariety of genomic properties,Genome Biol., 3:research0040
Lynch and Conery, (2003)M Lynch and JS Conery (2003),The origins of genome complexity,Science, 302:1401-1404.
Madani et al., (2023)A Madani, B Krause, ER Greene, S Subramanian, BP Mohr, JM Holton, JL Olmos Jr., C Xiong, ZZ Sun,R Socher, JS Fraser, N Naik (2023),Large language models generate functional protein sequences across diverse families,Nature Biotech., 41:1099-1106.
Mantegna et al., (1994)RN Mantegna, SV Buldyrev, AL Goldberger, S Havlin, CK Peng, M Simons, HE Stanley (1994),Linguistic features of noncoding DNA sequences,Phys. Rev. Lett., 73:3169-3172.
Medini et al., (2020)D Medini, C Donati, R Rappuoli, H Tettelin (2020),The pangenome: a data-driven discovery in biology,in The Pangenome, eds. H Tettelin and D Medini, pp.3-20 (Springer, Switzerland).
Menzerath, (1928)P Menzerath (1928),Über einige phonetische probleme, inActes du premier Congres International de Linguistes, pp. 104–105 (Sijthoff, Leiden, Netherlands).
Miller et al., (1985)J Miller, AD McLachlan, A Klug (1985),Repetitive zinc-binding domains in the protein transcription factor IIIA from Xenopus oocytes,EMBO J., 4:1609-1614.
Mistry et al., (2021)J Mistry, S Chuguransky, L Williams, M Qureshi, GA Salazar, ELL Sonnhammer,SCE Tosatto, L Paladin, S Raj, LJ Richardson, RD Finn, A Bateman (2021),Pfam: The protein families database in 2021Nucl. Acids Res. , 49:D412–D419.
Moghaddasi et al., (2017)H Moghaddasi, K Khalifeh, AH Darooneh (2017),Distinguishing functional DNA words; a method for measuring clustering levels,Sci. Rep., 7:41543.
Mukhopadhyay et al., (2006)I Mukhopadhyay, A Som, S Sahoo (2006),Word organization in coding DNA: A mathematical model,Theory in Biosci., 125:1-17.
Müller et al., (2002)A Müller, RM MacCallum, MJE Sternberg (2002),Structural characterization of the human proteome,Genome Res.,12:1625-1641.
Murzin et al., (1995)AG Murzin, SE Brenner, T Hubbard, C Chothia (1995),SCOP: A structural classification of proteins database for the investigation of sequences and structures,J. Mol. Biol., 247:536-540.
Nasir et al., (2017)A Nasir, KM Kim, G Caetano-Anollés (2017),Phylogenetic tracings of proteome size support the gradual accretion of proteinstructural domains and the early origin of viruses from primordial cells,Front. Microbiol., 8:1178.
Nelson et al., (2015)SC Nelson, JH Yum L Ceccarelli (2015),How metaphors about the genome constrain CRISPR metaphors: separating the “Text” from its “Editor”,Am. J. Bioethics, 15:60-62.
Newman, (2005)MEJ Newman (2005),Power laws, Pareto distributions and Zipf’s law,Contemporary Phys., 46:323-351.
Nijkamp et al., (2023)E Nijkamp, JARuffolo, EN Weinstein, N Naik, A Madani (2023),ProGen2: Exploring the boundaries of protein language models,Cell Sys., 14:P968-P978.
Nikolaou, (2014)C Nikolaou (2014),Menzerath-Altmann law in mammalian exons reflects the dynamics of gene structure evolution,Comp. Biol. Chem., 53(A):134-143.
Ofer et al., (2021)D Ofer, N Brandes, M Linial (2021),The language of proteins: NLP, machine learning & protein sequences,Comp. Struct. Biotech. J., 19:1750-1758.
Paysam-Lafosse et al., (2022)T Paysan-Lafosse, M Blum, S Chuguransky, T Grego, BL Pinto, GA Salazar, ML Bileschi,P Bork, A Bridge, L Colwell, J Gough, DH Haft, I Letunić, A Marchler-Bauer,H Mi, DA Natale, CA Orengo, AP Pandurangan, C Rivoire, CJA Sigrist, I Sillitoe,N Thanki, PD Thomas, SCE Tosatto, CH Wu, A Bateman (2022),InterPro in 2022,Nucl. Acids Res., 51:D418-D427.
Petersen et al., (2012)AM Petersen, JN Tenenbaum, S Havlin, HE Stanley, M Perc (2012),Languages cool as they expand: Allometric scaling and the decreasing need for new words,Sci. Rep., 2:943.
Phillips, et al., (1987)GJ Phillips, J Arnold, R Ivarie (1987),The effect of codon usage on the oligonucleotide composition of the E. coligenome and identification of over—and underrepresented sequences by Markov chainanalysis, Nucl. Acids Res., 15:2627–2638.
Qian et al., (2001)J Qian, NM Luscombe, M Gerstein (2001),Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model,J. Mol. Biol., 313:673-681.
Radford et al., (2019)A Radford, J Wu, R Child, D Luan, D Amodei, I Sutskever (2019)Language models are unsupervised multitask learners,preprint, https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Rahman et al., (2018)A Rahman I Hallgrimsdottir, M Eisen, L Pachter (2018),Association mapping from sequencing reads using k-mers,eLife, 7:e32920.
Rao et al., (2020)R Rao, J Meier, T Sercu, S Ovchinnikov, A Rives (2020),Transformer protein language models are unsupervised structure learners,BioRxiv preprint, doi:10.1101/2020.12.15.422761
Scaiewicz and Levitt, (2015)A Scaiewicz and M Levitt (2015),The language of the protein universe,Development, 35:50-56.
Searls, (2002)DB Searls (2002),The language of genes,Nature, 420:211-217.
Semple et al., (2022)S Semple, R Ferrer-i-Cancho, ML Gustison (2022),Linguistic laws in biology,Trends in Eco. Evo., 37:53-66.
Sheinman et al., (2016)M Sheinman, A Ramisch, F Massip, PF Arndt (2016),Evolutionary dynamics of selfish DNA explains the abundance distribution ofgenomic subsequences,Sci. Rep. , 6:30851.
Stephens and Iyre, (2018)ZD Stephens and RK Iyer (2018),Measuring the mappability spectrum of reference genome assemblies,in BCB’18: Proc. 2018 ACM Int. Conf. on Bioinformatics, Comp. Biol. and Health Informatics(ACM, New York, NY, USA). URL: https://doi.org/10.1145/3233547.3233582
Tellelin et al., (2008)H Tettelin, D Riley, C Cattuto, D Medini (2008),Comparative genomics: the bacterial pan-genome,Curr. Opin. Microbiol., 11:472-477.
The UniProt Consortium, (2012)The UniProt Consortium (2012),Reorganizing the protein space at the Universal Protein Resource (UniProt),Nucl. Acids Res., 40:D71-D75.5
Tunnicliffe and Hunter, (2022)M Tunnicliffe and G Hunter (2022),Random sampling of the Zipf–Mandelbrot distribution as a representation of vocabulary growth,Physica A, 608:128259.
van Leijenhorst and van der Weide, (2005)DC van Leijenhorst and TP van der Weide (2005),A formal derivation of Heaps’ Law,Info. Sci., 170:263-272.
Vilo, (2002)J Vilo (2002),Pattern Discovery from Biosequences, Ph.D Thesis (Department of Computer Science,University of Helsinki).
Wagner, (2019)A Wagner (2019),Life Finds a Way: Mapping the Origins of Creativity (Basic Books, New York, NY, USA).
Wang et al., (2021)Y Wang, H Zhang, H Zhong, Z Xue (2021),Protein domain identification methods and online resources,Comp. Struc. Biotech. J., 19:1145-1153.
Webster and Kit, (1992)JJ Webster and C Kit (1992),Tokenization as the initial phase in NLP,in Proc. 14th Conf. Comp. Linguistics, vol.4pp.1107-1110 (Asso. Comp. Linguistics).
Wetzel, (2009)L Wetzel (2009),Types and Tokens: On Abstract Objects (MIT Press, Cambridge, MA, USA).
Zielezinski et al., (2017)A Zielezinski, S Vinga, J Almeida, WM Karlowski (2017),Alignment-free sequence comparison: benefits, applications, and tools,Genome Biol., 18:186.
Yu et al., (2019)L Yu, DK Tanwar, ED Penha YI Wolf, EV Koonin, MK Basu (2019),Grammar of protein domain architectures,Proc. Natl. Acad. Sci., 116:3636-3645.
Zipf, (1935)GK Zipf (1935),The Psycho-Biology of Languages (Houghtion-Mifflin, Boston, MA).