Universal distribution of component frequencies in biological and technological systems

Bacterial genomes and large-scale computer software projects both consist of a large number of components (genes or software packages) connected via a network of mutual dependencies. Components can be easily added or removed from individual systems, and their use frequencies vary over many orders of magnitude. We study this frequency distribution in genomes of ∼500 bacterial species and in over 2 million Linux computers and find that in both cases it is described by the same scale-free power-law distribution with an additional peak near the tail of the distribution corresponding to nearly universal components. We argue that the existence of a power law distribution of frequencies of components is a general property of any modular system with a multilayered dependency network. We demonstrate that the frequency of a component is positively correlated with its dependency degree given by the total number of upstream components whose operation directly or indirectly depends on the selected component. The observed frequency/dependency degree distributions are reproduced in a simple mathematically tractable model introduced and analyzed in this study.

[1]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[2]  E. Koonin,et al.  Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world , 2008, Nucleic acids research.

[3]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[4]  T Maillart,et al.  Empirical tests of Zipf's law mechanism in open source Linux distribution. , 2008, Physical review letters.

[5]  Gabriela Koreisová,et al.  Scientific Papers , 1997, Nature.

[6]  Antoine Danchin,et al.  The extant core bacterial proteome is an archive of the origin of life , 2007, Proteomics.

[7]  Oliver Ebenhöh,et al.  Expanding Metabolic Networks: Scopes of Compounds, Robustness, and Evolution , 2005, Journal of Molecular Evolution.

[8]  A. Danchin,et al.  Organised Genome Dynamics in the Escherichia coli Species Results in Highly Diverse Adaptive Paths , 2009, PLoS genetics.

[9]  Joshua S Weitz,et al.  A neutral theory of genome evolution and the frequency distribution of genes , 2012, BMC Genomics.

[10]  Mark Gerstein,et al.  Comparing genomes to computer operating systems in terms of the topology and evolution of their regulatory control networks , 2010, Proceedings of the National Academy of Sciences.

[11]  G. Zipf Selected Studies of the Principle of Relative Frequency in Language , 2014 .

[12]  E. Koonin The Logic of Chance: The Nature and Origin of Biological Evolution , 2011 .

[13]  Sergei Maslov,et al.  A Toolbox Model of Evolution of Metabolic Pathways on Networks of Arbitrary Topology , 2010, PLoS Comput. Biol..

[14]  Susumu Goto,et al.  Extraction of phylogenetic network modules from the metabolic network , 2006, BMC Bioinformatics.

[15]  G. Yule,et al.  A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, F.R.S. , 1925 .

[16]  A. Danchin Bacteria as computers making computers , 2008, FEMS microbiology reviews.

[17]  D J PRICE,et al.  NETWORKS OF SCIENTIFIC PAPERS. , 1965, Science.

[18]  John Yen,et al.  Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis , 2007, KDD 2007.

[19]  Albert-László Barabási,et al.  Statistical mechanics of complex networks , 2001, ArXiv.

[20]  Mikko Alava,et al.  Branching Processes , 2009, Encyclopedia of Complexity and Systems Science.

[21]  Diomidis Spinellis,et al.  Power laws in software , 2008, TSEM.

[22]  Rachid Guerraoui,et al.  Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of distributed computing , 2010, PODC 2010.

[23]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[24]  Damian Szklarczyk,et al.  eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges , 2011, Nucleic Acids Res..

[25]  H. Simon,et al.  ON A CLASS OF SKEW DISTRIBUTION FUNCTIONS , 1955 .

[26]  Antoine Danchin,et al.  Persistence drives gene clustering in bacterial genomes , 2008, BMC Genomics.

[27]  Sergei Maslov,et al.  Toolbox model of evolution of prokaryotic metabolic networks and their regulation , 2009, Proceedings of the National Academy of Sciences.

[28]  Pascal Lapierre,et al.  Estimating the size of the bacterial pan-genome. , 2009, Trends in genetics : TIG.

[29]  Nathan LaBelle,et al.  Inter-Package Dependency Networks in Open-Source Software , 2004, ArXiv.

[30]  R. Albert,et al.  The large-scale organization of metabolic networks , 2000, Nature.

[31]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..