The study group topic for discussion this week was on formal methods of modeling malware and viruses. In the late 80's Fred Cohen, at the time a student of Leonard Adleman, described an abstract definition of viruses and used it to prove that there is no algorithm that can perfectly detect all possible viruses (text). Adleman himself published a paper a year later, using an abstract definition of a computer virus to answer questions on the detection of, and protection against viruses (text). Since the late 80's the capability of malware designers has increased and the complexity of malware itself has evolved, resulting in a renewed interest in abstract definitions and modeling of malware and viruses. In the study group Theo introduced some more recent research that follows on from Cohen and Adleman's earlier work; "Toward an Abstract Computer Virology" by G. Bonfante, M. Kaczmarek, and J.-Y. Marion ((pdf) and "Math on Malware" by Henk-Jan van der Molen (pdf).
Bonfante et al. suggest a newer definition for viruses that encapsulates Adleman's previous definitions, amongst others, and use the new definition to make an interesting observation on virus detection and to propose a countermeasure against a particular type of virus propagation. A virus is a program with a propagation mechanism that can be described in terms of some computable function B, that selects from a list of available programs and then replicates the virus inside the target. In other words, B is the delivery mechanism; some functional property or flaw in the programming language used by the virus to infect the system. A common approach from previous attempts at formalising viruses has been to define a virus as a self-replicating device; a virus has the capability to act on a description of itself. This leads to a definition that uses Kleene's recursion theorem to construct the viral mechanism. The self-reproduction is provided by a duplication function that copies a virus into a program in different ways.
There are three particular types of duplication function to mimic three types of virus behaviour. A crushing is simply a virus that has some behaviour involving copying itself, e.g. to a directory. A cloning is a virus that copies itself into some parts of a program environment, without increasing the overall program size (typically by overwriting honest code). Finally an ecto-symbiotic virus essentially attaches itself to a host program, without overwriting. Further definitions that cover polymorphic viruses (viruses that in some sense mutate when they reproduce) are discussed in the paper. The authors also show how these definitions can encapsulate Adleman's original definitions by translating it into the new formalism. An interesting result is that it can be proven, under the definitions that are approximately described above, that there are at least some propagation mechanisms B for which it is decidable whether a particular program is a virus.
A slightly different question is to ask how malware or viruses propagate throughout a network once they are introduced into the environment. Henk-Han van der Molen focuses on using ideas from epidemiology and network theory to study how infection rates change over time once a network is compromised. In particular, a highly useful metric would be to quantify how impactful particular countermeasures and security methods are at reducing infection rates (and increasing clean-up rates). For instance, potential tools for alleviation such as deployment of anti-virus software, periodic re-imaging, increased user education, and incident management/disaster recovery procedures have varying economic costs, so knowing their real effectiveness as countermeasures can increase efficiency.
The particular network model chosen for analysis is described as the "Susceptible-Infected-Susceptible" (SIS) model. From a biological standpoint, it captures a scenario whereby after contracting a disease, and recovery, humans are still susceptible to re-acquiring the condition later on. This is perhaps the best-fit for modeling a computer network, as it may be that machines that are cleaned/re-imaged/patched still have potential vulnerabilities. Without going into the mathematical description too heavily, the SIS model starts by assuming that in the early phase, infection rates grow slowly because there are few infected computers to spread the malware. The number of infected computers that are returned to the susceptible state is proportional to the number of infected computers. Two other additional factors are the probability of resuscitation, which models the effectiveness of the countermeasures in cleaning an infected node back to its susceptible state, and the contamination factor, which models the probability an infected node can infect a susceptible neighbour.
The SIS model is analysed in three particular scenarios. The first is the 'corporate' scenario, with statistics from malware infections in organisations used to estimate the resuscitation and contamination factors. In the 'practice' scenario, the goal of the malware is assumed to be to infect as many computers as quickly as possible. In this instance, the parameter estimation is difficult because there are few sources of reliable statistics available. In the final 'cyberwarfare' scenario, the malware needs to remain undetected for as long as possible to keep target nodes in the infected state. One interesting result drawn from the model is that with a periodic reset of software, in the corporate scenario the steady state of malware infections drops almost to zero. With the increasing proliferation of virtualisation techniques in business, executing this strategy frequently may become more and more viable.
Another mitigating factor in the spread of malware is the diversity of the software installed on a network. A fully Windows-based network could potentially be entirely compromised by a single Windows-based virus, but a network comprised of a mix of Mac OS, Linux and Windows could not be infected to the same extent by just one such virus. An analysis of the SIS model seems to support this hypothesis, with increasing diversity resulting in slower infection rates. However the practical and economic implications of this strategy are much harder to capture: the support costs for supporting diverse software are higher, typically vendors try to encourage lock-in, a reliance on software supporting shared data standards is introduced, and so on.
The research poses a challenging yet important problem. Through modeling it will be possible to improve our understanding of virus and malware propagation and the effectiveness of tools and procedures as countermeasures, but a further understanding of the implications and costs of the mitigating factors is also required to have a complete solution.