Klocwork Multicore

GWYN
FISHER, CTO WHITE PAPER | SEPTEMBER 2010
Developing Software in a Multicore & Multiprocessor World

Tool-based approach for nding complex concurrency issues and endian incompatibilities
To keep pace with customer demands for more functionality and speed, software teams are moving away from single processor architectures at a rapid rate. In particular, embedded devices that used to have one chip to perform a constrained set of tasks are now working in heterogeneous processor environments where processors are used for network connectivity, multi-media, and a whole variety of requirements. According to new data from VDC Research, this trend is only expected to accelerate: engineers expect that in two years time, the number of single processor projects will drop by half.
The business impact of this growing complexity is stark: multicore and multiprocessor software projects are 4.5X more expensive, have 25% longer schedules, and require almost 3X as many software engineers.1 One area in particular where this growing complexity can have a dramatic impact on cost and schedule overruns is in the area of software testing and code inspection. A multicore/ processor environment can add exponential complexity to effectively identifying errors in software. There are two areas in particular that have the ability to drag the productivity of a software team through the floor: concurrency errors and endian incompatibilities. This whitepaper will discuss these types of issues in detail, explain how Klocworks source code analysis engine, Klocwork Truepath can be used to address them, and walkthrough two examples of these problems in prominent open source projects.
Current Project
Multicore and multiprocessor 5.2% Multicore 9.3% Multiprocessor 20.8% Single processor 61.8% Dont know 2.9% Multicore and multiprocessor 19.4%
Expected in 2 Years
Dont know 8.5%
Single processor 30.1% Multicore 21.4% Multiprocessor 20.6%
Figure 1 | Processing Architecture Used in the Current Project and Expected in Next Two Years (Percent of Respondents)
VDC Research, Next Generation Embedded Hardware Architectures: Driving Onset of Project Delays, Costs
Overruns, and Software Development Challenges, September 2010.
WWW.KLOCWORK.COM
Tackling Concurrency Issues and Endian Incompatibilities with Klocwork Truepath______

Concurrency Issues
Source code analysis is a process by which the expected, or predicted behavior of a program at runtime is exercised, along every conceivable control flow path, in order that aberrant situations be found, diagnosed, and described to the author in such a way as to make them simple to fix. In the typical course of events, no timing information or order, other than that inherent in the control flow graph, is interpreted or required for this analysis to take place. Concurrency issues pose a complex set of challenges for analysis, as they do require timing or ordering information to be promoted into the control flow graph. Some are obviously less difficult to find than others, such as threads that reserve locks and perform time-consuming activities before releasing. This type of behavior, whilst not leading to a critical failure such as a deadlock, can lead to frustration on the part of the end user of the software, for example in the face of an unresponsive device. The more complex type of concurrency issues, such as deadlocks, require an additional type of analysis over-and-above that performed when finding non-orderrelated bugs such as memory leaks or buffer overruns. In this case, we must perform two different types of analysis: one that gathers and propagates lock lifecycle behavior, and another that can analyze the whole program space and find conflicts in this behavior. Klocwork Truepath makes this possible via the addition of a new concurrency analysis engine to its existing tool chain:
Compile
Emulate native build Build control ow graph
Symbolic logic
Analyze control ow graph Perform dataow analysis
Concurrency
Analyze lock dependencies Figure 2 | Klocwork Truepath tool chain provides concurrency analysis engine after control flow graph analysis and build emulation.
In this figure you can see that data relating to lock lifecycles is gathered by the normal analysis engine, and once this has been produced for all modules in the system, the whole program space is then analyzed by the new concurrency analysis engine so that loops in the lifecycle graph can be found, which equate to deadlocks. Consider a function that operates as follows:
lock_t Lock1, Lock2; void foo(int x) { if( x & 1 ) { lock(Lock1); lock(Lock2); } else lock(Lock1); }
Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 2
You can easily see by inspection that when passed an odd number as its parameter, this function defines a dependency of Lock2 upon Lock1. Failing an odd parameter, Lock1 is still reserved, but this time there is no dependency of Lock2 upon Lock1 at the local scope, although there may still remain that dependency (or another) at an inter-procedural scope. Therefore, we have two discrete types of questions to ask when performing the analysis: 1. Symbolic logic questions: a. Is there a valid control flow that gets us to call function foo() with an odd parameter? b. Is there a valid control flow that results in foo() being called with an even parameter followed by a call to another function that results in another lock (e.g. Lock2) being reserved before Lock1 is released? 2. Lock dependency questions: a. If either of these are so, is there any other situation in the programs natural control flow whereby a counter-dependency of Lock1 upon Lock2 can be reached, potentially resulting in a deadlock? The first type of question is answered by Klocwork Truepaths symbolic logic engine during the normal course of program analysis, just as any other type of defect is analyzed for inter-procedural data flows that can or cannot occur. The second type of question is then answered by the concurrency analysis engine, fed by the collection of all possible dependencies within the program space. The result is what tends to be a small set of incredibly difficult to find (manually), and insanely difficult to understand (without a tool) deadlock scenarios that developers can triage and fix very quickly within the natural course of their implementation tasks.
Endian Incompatibilities
Whilst it may be true that there are 10 kinds of people in the world, a switch from a little endian platform to a big endian platform will muddy that impression considerably. An advisor of ours recently informed me with glee that hed finally set his MSB (having passed his 64th birthday), but store that in nibble representation on an unexpected endian architecture and hed be regressing to the nursery once more. In short, endian representations affect how the host processor stores integral types in memory. Considering 32-bit integers, each of which consists of four bytes of memory, the processor can chose to read and write those four bytes in a variety of orders, although traditionally only two are used: Little endian, in which the bytes are written in the order 0, 1, 2, 3 Big endian, in which the bytes are written in the order 3, 2, 1, 0
This picture becomes slightly muddied if the processor actually writes words at a time (this is mostly a fairly historical representation now, but we mention it for completeness), and applies its endian assumptions to each word: Little endian still writes bytes in the order 0, 1, 2, 3 Big endian, however, may now write bytes in the order 1, 0, 3, 2
However the processor stores and reads such types is entirely at its own discretion and the business of nobody else. Until, that is, the developer directs the processor to write such data into a medium for transmission, as opposed to storage in memory. Transmission media, which could be sockets, files, pipes, or any other interprocessor vector (e.g. interrupts that cause data to be written to the PCI-Express interface, or to the serial bus, or), are addressed by the processor in exactly the same way as memory unless specifically told to do otherwise.
Thus, a big endian processor will write a 32-bit integer onto a socket in byte order 3, 2, 1, 0. If the CPU on the other end of the socket uses a little endian architecture, then obviously a value written onto the socket will be interpreted completely differently when read. For example, a value of 29, written by a big endian processor and read by a little endian processor will be interpreted as 53,504 not a small correction by any means. Preparing a program for use with heterogeneous processor architectures therefore involves finding every integral type that ever hits a transmission vector that could legitimately target another processor and ensuring that the read/write operation involved transforms the data into / from a neutral representation that both sides agree on. In a program of any size at all, obviously this is a non-trivial task. Klocwork Truepath can help developers in this task as it now includes the ability to validate type representation usage symmetrically as those types cross transmission vector boundaries. That is, the data flow engine within Klocwork Truepath automatically validates that types that are written directly to a transmission vector are subject to host-to-neutral format transformation before the write operation takes place. Likewise, integral types read from a transmission vector are tracked to ensure that they are appropriately transformed prior to the first attempted usage on the host. For example, consider the following function:
void foo(int sock) { int x; for( x = 0; x < 256; x++ ) if( send(sock, &x, sizeof int) < sizeof int) return;
This simple function makes the basic assumption that the reader on the other end of its socket has the same processor architecture as the sender. This might be true, or more accurately it might be true today, but what designer can ever look far enough into the future to know that it will always be true, regardless of market shifts, great ideas that marketing interns have, etc. Klocwork Truepath, upon analysis of this function, will point out: Value x is used in host byte order, but should be used in environment/ network byte order. A developer versed in inter-architectural development will naturally modify this function to transform the value of the variable x prior to transmission:
void foo(int sock) { int x, xt; for( x = 0; x < 256; x++ ) { xt = htonl(x); // or some other suitable form if( send(sock, &xt, sizeof int) < sizeof int) return; }
Likewise when it comes to reading information across a transmission vector, Klocwork Truepath traces the data flow of any received integral types to ensure, in exactly the opposite way to sending, that any such values are transformed to host format prior to their first usage.
Open Source Case Studies________________________________________________________________________________________

Lock Contention: SQLite ca. 2006
Long addressed by the developers of this great open source project, a deadlock was reported in the execution of the database engine and was traced to code that was specifically intended to guard against such an occurrence (as is usually the case). Although complicated to understand, and certainly the eventual fix resulted in an almost total rewrite of the offending module, requiring days or perhaps weeks of intense manual debugging and thought-modeling without a tool such as Klocwork Insight, this very nasty bug was found and correctly described by Klocwork Truepath during an analysis that took mere minutes. Consider the requirement to implement a simplistic singleton recursive lock capability within an environment that doesnt support such constructs. Using reference counting, we can quite simply guard the underlying non-recursive lock and manage its lifecycle appropriately. Of course, this being a parallel world, we need to use another lock to guard the reference count that were using to guard the real lock, making the implementation just a bit more complicated. The design of this might look something like the following example:
lock_t lock1, lock2; int refCount = 0; void enter() { reserve_lock(lock1); if( refCount == 0 ) reserve_lock(lock2); release_lock(lock1); refCount++; } void leave() { reserve_lock(lock1); refCount--; if( refCount == 0 ) release_lock(lock2); release_lock(lock1); }
Now I can call enter() multiple times, simulating some of the capabilities of a true recursive lock, and as long as I remember to call leave() an equal number of times the lifecycle of the underlying non-recursive lock is managed correctly:
void foo() { // real lock is reserved enter(); if( i-really-want-to ) { // only the reference count is affected enter(); leave(); } // now the real lock is released leave(); }
Now consider the requirement to implement an abstraction over thread-specific data storage. To ensure safety when allocating such a structure, the database engine uses the singleton recursive lock described above to protect its activities with an implementation that simplifies as follows:
int tlsCreated = 0; data_t* create_data() { static data_t* tls; enter(); if( tlsCreated == 0 ) tls = create_thread_data(); tlsCreated = 1; leave(); init_data(tls); return tls;
To simple inspection, this appears quite correct as it calls leave() the same number of times as enter() and thus should be considered well behaved. Unfortunately life in the parallel world is rarely simple to analyze, and this case is certainly more complicated than it first appears. Consider a two core CPU executing two threads, both calling create_data at very slight offsets in time. The first thread lets call our threads Thread 1 and Thread 2 begins executing create_data() and successfully calls the enter() function. This results in the underlying lock, lock 2, being reserved to Thread 1:
Thread 1 create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) release(lock1) refCount = 1
Now lets assume that Thread 2 begins its execution of create_data() during the time that Thread 1 is active, and before it releases lock 1:
Thread 1 create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) release(lock1) Thread 2
create_data() enter() reserve(lock1)
One further assumption makes the scenario whole: Thread 1 at this moment is interrupted by the operating system, losing its time on chip. Crucially, this happens before the reference count is updated. (Check the implementation of enter() and youll see that the author unfortunately left the reference count update outside of what is supposed to guard access to it.) As the reference count will therefore still read zero for Thread 2, it will attempt to reserve lock 2, resulting in Thread 2 blocking (as lock 2 is already owned by Thread 1):
Thread 1 create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) release(lock1) interrupted
Thread 2
create_data() enter() reserve(lock1) refCount = 0 reserve(lock2) blocked
Upon return from interrupt, Thread 1 is released and resumes execution where it left off, incrementing the reference count and returning from the enter() function. Its execution of create_data() continues, leading to a call to the leave() function, which unfortunately attempts to reserve lock 1 before doing anything else:
Thread 1 create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) release(lock1) interrupted refCount = 1 return leave() reserve(lock1); blocked Thread 2
create_data() enter() reserve(lock1) refCount = 0 reserve(lock2) blocked
Due to the fact that Thread 2 is currently blocked, waiting on lock 2, and currently owns lock 1, Thread 1 will now block on its own attempt to reserve lock 1. In short, this is a classic lock-order inversion contention caused by a poorly guarded data item, which when subject to race condition (being read by one thread whilst in the process of being updated by another) causes one thread to reserve locks in order while the other thread attempts to reserve them out of order, resulting in a deadlock. With the race condition fixed, this singleton will operate correctly, although as previously described the author actually chose to completely rewrite this module, providing a more useful re-entrant mutual exclusion capability for multiple threads, i.e. removing the singleton semantic.
Figure 3 | Source listing from SQLite
Figure 4 | Control flow description from Klocwork Truepath
Endian Design Assumptions: PostgreSQL

In contrast to the situation described in relation to SQLite, the findings in this case study dont point to bugs in software as much as they do to limited design decisions and the impact they have on how software is then constructed. Specifically, when designing a multi-process application, the architect is faced with the fundamental decision of whether all of those processes are going to be supported on one chip, or whether for the sake of scale or pure flexibility, the software will support being deployed and executed on multiple chips / hosts / devices at once. In the case of PostgreSQL, one of the processes detached from the main kernel is the statistics collector, something that acts more or less as a performance monitor, allowing the DBA to understand whats going on within the kernel, without necessarily impacting the performance of the kernel whilst running reports or monitors against those statistics. This provides a nice analog for a typical application-layer process set that need to interact with each other, but which due to design could be implemented to operate on either the same CPU / host or a completely different one. To implement this low touch collection and reporting mechanism, the PostgreSQL designers chose to fork() a process, presumably on the same CPU or multi-CPU package, and then use an asynchronous socket to transmit data from the kernel process to the collector. Using the pgstat application, the DBA can then interact with whatever the child process has collected at any point in time. All of this is encoded within the module src/backend/postmaster/pgstat.c. Because of the way that this fundamental decision was taken in this particular case, the designer chose to encode data transmission between the kernel and the collector using host-native representation. For example:
Figure 5 | Data representation analysis in action
In this example, its simple to see the assumption in all its glory, as that data member msg.msg_hdr.m_size is read and used directly off the wire, in what could be, but isnt in this case, network order. Now lets assume that a new generation of designers revisit this decision and instead place emphasis on scale and flexibility over ease of implementation. Now they decide to place the statistics collector process on an arbitrary node in the hardware design, rather than on the same node as the kernel process. With this decision in place, the assumption that network byte order and host byte order are the same can no longer be made in general. Porting to this new assumption set could take significant time, both for developers and for the test crew, faced with putting together a matrix of CPUs / hosts that embody the plethora of representations we can expect to support in the field. Using a tool-driven approach, however, this entire effort can be collapsed to a single analysis pass, taking minutes in total, to see a report of whats involved. In this case, the designers would be faced with the following endian vulnerabilities that would need to be addressed (along with the obvious logistical issues around how to place the process on the right host/CPU, of course): pgstats.c: line 1988: function pgstat_recvbuffer() Value msg.msg_hdr.m_size is used in network order. pgstats.c: line 1443: function pgstat_send() Value *msg is used in host byte order.
These two simple issues might be thought of as the whole problem domain. However, looking further into what this module is capable of, certain information can be persisted across sessions using a statistics file. If we further our decision to allow the process to be spawned on heterogeneous hardware, we might well continue that spread by allowing different instantiations of said process to occur on heterogeneous hardware, thus requiring persistent data to be endian safe: pgstats.c: line 2556: function pgstat_read_statsfile() Value format_id is used in environment byte order. Similar errors can be found on line(s): 2610, 2684, 2717, 2740. pgstats.c: line 2312: function pgstat_write_statsfile() Value format_id is used in host byte order. Similar errors can be found on line(s): 2351, 2384, 2411, 2412.
Armed with this information, the designer can make all required updates to remove endian vulnerability from their code in one pass.
Conclusion________________________________________________________________________________________________________________
The complexity of this problem domain is vast, so theres no one solution, tool, or approach that will address all your problems. Development teams need to equip themselves with good tools, smart design assumptions, and even smarter developers to reconcile the feature race being demanded by the market and the underlying platform complexity that implies. When it comes to selecting a tool, source code analysis should be on your shortlist as it offers a compelling mix of scalability, flexibility and the abiltiy to address a broad set of issues that will help you to ensure the overall quality and security of your code.
About the Author_______________________________________________________________________________________________________

Gwyn Fisher is the CTO of Klocwork and is responsible for guiding the companys technical direction and strategy. With nearly 20 years of global technology experience, Gwyn brings a valuable combination of vision, experience, and direct insight into the developer perspective. With a background in formal grammars and computational linguistics, Gwyn has spent much of his career working in the search and natural language domains, holding senior executive positions with companies like Hummingbird, Fulcrum Technologies, PC DOCS and LumaPath. At Klocwork, Gwyn has returned to his original passion, compiler theory, and is leveraging his experience and knowledge of the developer mindset to move the practical domain of static analysis to the next level.
About Klocwork_________________________________________________________________________________________________________
Klocwork offers a portfolio of software development productivity tools designed to ensure the security, quality and maintainability of complex code bases. Using proven static analysis technology, Klocworks tools identify critical security vulnerabilities and quality defects, optimize peer code review, and help developers create more maintainable code. Klocworks tools are an integral part of the development process for over 700 customers in the consumer electronics, mobile devices, medical technologies, telecom, military and aerospace sectors.
IN THE UNITED STATES: 15 New England Executive Park Burlington, MA 01803
IN CANADA: 30 Edgewater Street, Suite 114 Ottawa, ON K2L 1V8
t: 1.866.556.2967 f: 613.836.9088 www.klocwork.com
Copyright Klocwork Inc. 2010 All Rights Reserved

Klocwork Multicore

Cargado por

Información del documento

Descripción original:

Título original

Derechos de autor

Formatos disponibles

Compartir este documento

Compartir o incrustar documentos

Opciones para compartir

¿Le pareció útil este documento?

¿Este contenido es inapropiado?

Copyright:

Formatos disponibles

Klocwork Multicore

Cargado por

Copyright:

Formatos disponibles

GWYN

FISHER, CTO WHITE PAPER | SEPTEMBER 2010

Developing Software in a Multicore & Multiprocessor World

Single processor 30.1% Multicore 21.4% Multiprocessor 20.6%

Overruns, and Software Development Challenges, September 2010.

Tackling Concurrency Issues and Endian Incompatibilities with Klocwork Truepath______

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 2

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 3

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 4

Open Source Case Studies________________________________________________________________________________________

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 5

create_data() enter() reserve(lock1)

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 6

Thread 1 create_data() enter() refCount = 0 reserve(lock1) reserve(lock2) release(lock1) interrupted

create_data() enter() reserve(lock1) refCount = 0 reserve(lock2) blocked

create_data() enter() reserve(lock1) refCount = 0 reserve(lock2) blocked

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 7

Figure 3 | Source listing from SQLite

Figure 4 | Control flow description from Klocwork Truepath

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 8

Endian Design Assumptions: PostgreSQL

Figure 5 | Data representation analysis in action

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 9

Developing Software in a Multicore & Multprocessor World | Klocwork White Paper | 10

About the Author_______________________________________________________________________________________________________

IN THE UNITED STATES: 15 New England Executive Park Burlington, MA 01803

IN CANADA: 30 Edgewater Street, Suite 114 Ottawa, ON K2L 1V8

t: 1.866.556.2967 f: 613.836.9088 www.klocwork.com

Copyright Klocwork Inc. 2010 All Rights Reserved

También podría gustarte