



# Escola Tècnica Superior d'Enginyeria Informàtica Universitat Politècnica de València

### PINT, HERRAMIENTA DE SIMULACIÓN BASADA EN TRAZAS PIN

FINAL YEAR PROJECT

Computer engineering

Author: Francisco Blas Izquierdo Riera

Director: Julio Sahuquillo Borrás - UPV

Per Stenström - CTH

December 18, 2012

#### Abstract

In the course of this project we have developed a set of programs to improve the correction and execution time of the gem5 simulator.

For this, we moved the functional simulation step out of gem5 into an independent instrumented process to ensure correction in the functional stage and to provide a good execution speed (since the code will then be natively executed). This instrumentation is done by Pin.

Also, in order to allow efficient communication between the processes despite the limitations imposed by Pin to the available tools, an IPC framework to allow message passing between the processes was developed. This framework uses lockless fifo queues over shared memory so the resulting slowdown is minimal.

# Contents

| 1 | Intr | oducti             | ion                         | 4  |  |
|---|------|--------------------|-----------------------------|----|--|
|   | 1.1  | Projec             | et rationale                | 4  |  |
|   | 1.2  | Project objectives |                             |    |  |
|   | 1.3  | Projec             | et strengths                | 5  |  |
|   | 1.4  | Memor              | ory structure               | 7  |  |
| 2 | Stat | te of th           | he art                      | 8  |  |
|   | 2.1  | Archit             | sectural simulators         | 8  |  |
|   |      | 2.1.1              | Graphite                    | 10 |  |
|   |      | 2.1.2              | Multi2Sim                   | 10 |  |
|   |      | 2.1.3              | gem5                        | 11 |  |
|   | 2.2  | Instru             | mentation systems           | 11 |  |
|   |      | 2.2.1              | gprof                       | 12 |  |
|   |      | 2.2.2              | Pin                         | 12 |  |
|   | 2.3  | Bench              | marks                       | 13 |  |
|   |      | 2.3.1              | SPEC CPU2006                | 13 |  |
|   |      | 2.3.2              | SPLASH-2                    | 14 |  |
| 3 | Sys  | tem de             | escription                  | 16 |  |
|   | 3.1  | Mead:              | a message passing framework | 16 |  |

| Contents | Contents |
|----------|----------|
|          |          |

|              | 3.2          | Pint: a Pin based trace generator                                      | 18 |  |  |  |  |  |
|--------------|--------------|------------------------------------------------------------------------|----|--|--|--|--|--|
|              | 3.3          | Schnapps: a simple consumer of the traces                              | 19 |  |  |  |  |  |
|              | 3.4          | Gin5: a gem5 trace player                                              | 20 |  |  |  |  |  |
| 4            | Syst         | tem design                                                             | 21 |  |  |  |  |  |
|              | 4.1          | Mead: a message passing framework                                      | 21 |  |  |  |  |  |
|              | 4.2          | Pint: a Pin based trace generator                                      | 23 |  |  |  |  |  |
|              | 4.3          | Schnapps: a simple consumer of the traces $\ \ldots \ \ldots \ \ldots$ | 25 |  |  |  |  |  |
|              | 4.4          | Gin5: a gem5 trace player                                              | 26 |  |  |  |  |  |
| 5            | Res          | ults                                                                   | 28 |  |  |  |  |  |
| 6            | Con          | aclusions                                                              | 30 |  |  |  |  |  |
|              | 6.1          | Improvements for next release                                          | 31 |  |  |  |  |  |
| $\mathbf{A}$ | Use          | r manual                                                               | 32 |  |  |  |  |  |
|              | A.1          | Building                                                               | 32 |  |  |  |  |  |
|              | A.2          | Pint                                                                   | 32 |  |  |  |  |  |
|              | A.3          | Schnapps                                                               | 33 |  |  |  |  |  |
|              | A.4          | Gin5                                                                   | 33 |  |  |  |  |  |
| В            | Rele         | evant source code                                                      | 34 |  |  |  |  |  |
| Bi           | Bibliography |                                                                        |    |  |  |  |  |  |

# Chapter 1

### Introduction

### 1.1 Project rationale

Despite the vast amount of hardware simulators that exist nowadays, most of them either lack flexibility on the simulations or are slow since they simulate the code execution instead of instrumenting the natively executed executable. As a related issue, since code is simulated and not executed it is common to find bugs where the simulator will not set the processor state properly which cause corner cases where not acting as the processor causes execution issues with some programs.

Also many simulators lack support for parallel execution and those who do tend to add big overheads when running the simulation in a single machine and will not support instruction level simulation granularity.

Finally current simulators tend to add big overheads to the functional simulation step which makes it unfeasible to run large tests even when simulating simple systems.

Program instrumentation solves all these shortcomings by running the code natively (modified so it will also execute the instrumentation code),

allowing it to run in parallel and, since code is executed natively in the processor, providing a completely native execution.

Given the limitations of the current simulators we consider that the community needs a flexible and fast instrumentation based tracer able to be used with a broad range of programming languages who will handle local simulations in parallel, with small overheads and with instruction level granularity.

#### 1.2 Project objectives

Our main objective is providing an instrumentation based tracer that can be used with other simulators. Given the problems with the size these traces can have we will feed them in a lively fashion.

In order to see whether these objectives are met or not we will measure the slowdown compared to the non instrumented program with a simple trace consumer (to ensure it is not the bottleneck). Our objective is getting at least similar slowdowns to the ones of Graphite [9], but removing the caveats it has at least on single core processors.

Also we intend to design an architecture which can later be expanded to support multiple simultaneous execution threads. This will be done on a later version of the project though due to timing constraints.

### 1.3 Project strengths

The biggest problem with instrumentation-based systems is that the instrumentation code is limited heavily by the instrumentation API of the instrumentation system (for example the POSIX thread API can not be used with Pin), this also reduces vastly the number of languages that can be used, in

order to overcome these limitations, we use FIFO queues placed in shared memory to extract the data from a process to another, using the operating system process separation to execute the simulation in a different processor. As a result, the simulator overcomes the restrictions caused by the instrumentation framework since these will only apply to the process where the program is being instrumented.

As a side effect of this approach, the resulting simulators will be segmented since the functional simulation can be done on a different processing unit than the one running the simulation itself. As the number of processor cores increases it is likely that hardware simulators will use the segmentation approach more extensively in order to increase performance. Also, when shared the cache accesses caused by the shared memory communication cause slowdowns, hyperthreading processors can be used and proper processor affinity to the processes can be set so the critical simulation parts (i.e those responsible of bottlenecks) will be set along with the previous part on the different threads of a single core so the reads from the FIFO queue are likely to be on the level 1 cache.

To ensure real parallelism each thread of the instrumented program can use a different FIFO queue to extract its traces<sup>1</sup>. This also allows the user to limit instruction granularity by setting an appropriate queue size since the instrumentation will stop execution once the queue is full<sup>2</sup>.

As an example of how Pint can be used we also created Gin5 a slightly modified version of the gem5 simulator which uses Pint's instrumentation as the source of the memory access information during the simulation.

Another known problem is that simulators tend to be very good on sim-

 $<sup>^{1}\</sup>mathrm{The}$  completely multithreaded implementation will be finished with the next release of Pint

<sup>&</sup>lt;sup>2</sup>For this a simulation started event is added to ensure the first instruction will not be executed until the simulator wants it.

ulating a specific part of the system whilst having issues on others. We consider that in the future this FIFO system may be useful to interconnect simulators so the best of them can be gotten.

#### 1.4 Memory structure

In this introduction we presented the problem we are trying to fix, our objectives and our strengths.

On the next section, we will explain the state of the art at the time of our publication in the topics of Architectural Simulators, Instrumentation systems and Benchmarks.

Afterwards we will analyze the four modules we developed for the project and we will continue later with the design decisions.

We will finally present our benchmarking results and our conclusions.

Annexed you will find a brief user manual in case you want to try our system and the referred bibliography.

On the Annex folder you will find the sources we developed in this project.

# Chapter 2

### State of the art

Of the many simulators currently available, we have chosen three to explain which is the current state of the art for being the ones on which most work is being done nowadays: gem5, Multi2Sim and Graphite.

Also, as instrumentation tools we will cover gprof based profiling and Pin.

Finally, as benchmarks we will cover the SPEC CPU2006 and the SPLASH- 2 benchmarks.

#### 2.1 Architectural simulators

Architectural simulators are tools used to see how a proposed processor design would work without the need of building the processors themselves. Despite these share a some similarities with virtual machines in that they execute programs and that the main focus in both is the correct execution of the program; virtual machines have their main focus in providing a speedy execution of the program, whilst architectural simulators focus on providing good statistics of the program execution and executing the program in the same way the architecture would use.

Architectural simulators tend to be structured in a set of stages, disassembly, functional simulation and cycle by cycle simulation.

During the disassembly stage the machine code to be executed is transformed into a set of structures that can be understood by the simulator, the set of structures used is critical for an efficient simulation.

During the functional simulation the resulting set of structures is interpreted by the simulator to modify the internal state of the processor structures and the representation of the simulated program's memory space.

Finally during the cycle by cycle simulation the represented architecture and system are simulated in a cycle by cycle basis so the timing results are precise.

Simulators may have these stages clearly differentiated or not but all of the do have these stages.

Also some simulators emulating only the memory system (and further processor structures) are based on memory traces. A memory trace is a description of the memory accesses made by a particular program when run which is then replayed on the simulated memory system.

In general trace based simulators tend to be fast since they will not only remove the execution step but also use a more simplified model for the processor. But traces have a few problems: on one side the programs being run need to be run in a way in which they will be generated, for example with a dynamically instrumented program, and when big enough they can take a lot of space, for example a trace of the SPLASH-2 LU with contiguous blocks trace would take around 1.5GiB if each access could be stored in only 32 bits.

Anyway there are some nice works in trace generation with Pin for simulators like Dinero IV [7], an example of which can be found in the dinerotool [1] by Kenneth Barr.

When using traces it is hard to overcome the requirement of using traces, but, it is possible to overcome the space limitation restrictions by feeding them live into our memory simulator. This was the approach chose by us.

#### 2.1.1 Graphite

Graphite [18] [9] is a multicore simulator also written over Pin designed to provide real multithreading both when run locally and when run over a large number of computers. In order to do this, graphite hijacks some syscalls of the syscalls which will then be sent either to the local kernel or to the central kernel or to both. A similar procedure is used to track memory access es and an internal "MMU" tracks which machine has which copy of the memory.

In order to synchronize threads Graphite provides a few different synchronization ways of which the fastest is the lax synchronization method.

Given the popularity of this simulator nowadays Graphite was the simulator chosen as the reference against which we will compare the speed of our system.

Saddly, one of the major caveats with Graphite is that it is very system specific and, as a result, it was impossible for us to run it on our testing equipment.

#### 2.1.2 Multi2Sim

Multi2Sim [20] [3] is a simulator supporting a big set of targets to emulate different architectures, both CPU and GPU.

As a simulator it is split in different components, a disassembler intended to convert the input programs into something the simulator can understand and use, a functional simulator which maintains the CPU and memory state and runs the code and a cycle by cycle simulator which does the execution. It also provides some visual tools for checking how the simulation is run.

#### 2.1.3 gem 5

gem5 [16] [2] is the result of the merge of two powerful simulators: M5 and GEMS. gem5 is a simulator able to emulate some architectures both in Full Mode (this is, running the kernel as part of the simulation) and in Syscall Emulation mode (as the two aforementioned simulators by emulating the kernel for the provided binary).

As a full system simulator it is known for the flexibility it has for emulating different systems, not only by the number of architectures it supports but also by the number of devices it can emulate and the flexibility it provides in doing so.

The main problem it has is that although the modules are written in C++, they are usually run by a python script which complicates the system.

This flexibility gem5 was the reason for choosing this simulator as a target for implementing our system.

### 2.2 Instrumentation systems

Instrumentation systems provide ways to know how is the code running, either for later statistic generation and performance checking or for other uses like memory trace generation.

Instrumentation can be dynamic if the code that will control how the program is running is added when it is executed or static if this code is interleaved when building the program with the compiler. Normally dynamic instrumentation is preferable since it will allow us to instrument also propri-

etary programs and will not require a modified compiler.

#### 2.2.1 gprof

gprof [17] [6] is a profiling system used along with programs compiled with special flags by gcc [14] [13]. For this gcc will embed the profiling code and mix it with the compiled sources before assembly. This technique is called static instrumentation since it is done in compilation time.

Programs compiled with profiling flags will generate when run binary file, called gmon.out, containing the execution statistics. Afterwards a call to gprof can be used to interpret the generated file.

Although traces could be also generated by using these techniques the requirement of having to compile the programs with a particular compiler is an impediment in some cases thus the ideas provided by this system where discarded.

#### 2.2.2 Pin

Pin [21] [10] [8] on the other side is a dynamic instrumentation framework, this means that instrumentation code is added dynamically. For this Pin hijacks with ptrace the program to be run, as a debugger like gdb [15] [12] would do, and then loads the Pintools' code and the Pin framework into the running program and modifies the process so it will run the code produced by the JIT generator provided by PIN.

For this to work, Pin provides a modified version of the C++ runtime which has some features stripped down in order to prevent incompatibilities with the program being run. Anyway most of the C++ features can still be used by the tools and for those that can not Pin provides alternatives (for example locks).

The main problem with Pin is that running it on hardened systems is complicated since the default method used by Pin to attach to the program via ptrace is considered dangerous by these kernels (since it is not a parent attaching to its child but the other way around), also the JIT compiler provided by Pin causes problems because it tries to have mapPings which are both writable and executable which is another technique restricted by hardened systems.

Despite these issues Pin was the system chosen for providing the instrumentation framework.

#### 2.3 Benchmarks

Benchmarks are programs with standardized inputs that are used to measure and compare the performance of different systems running them. Depending of the component being measured different metrics can be used: power consumption, execution time, number of frames per second generated, etc. Of these in this project we care the most about execution time.

Benchmarks can be synthetic when they emulate the load caused by typical programs of a particular type, examples of which are Dhrystone [25] and Whetstone [5]; or application when they run one or more real world programs like the two we have analyzed. In general real world benchmarks provide more meaningful results since they allow you to see how will real applications behave.

#### 2.3.1 SPEC CPU2006

The SPEC CPU2006 [22] [11] benchmark is a set of programs from the real world which are provided along with some inputs to test the speed of a

system and with a main focus on the CPU execution speed. Despite being there since the 2006 these benchmarks are widely used and understood in the academic and real world.

Most of the programs provided with the benchmark are licensed with GPL style licenses and are well known in the free software world, for example gcc or perl, whilst others come from different research projects. It is because of this that the copyright is held over the input files in this benchmarks.

The main problem with these benchmarks is that they focus on single threaded processes.

#### 2.3.2 SPLASH-2

The SPLASH-2 [23] [26] benchmark was developed by the Flash research group at the Stanford university to provide a set of benchmarks that could be used on shared memory multiprocessor systems. Although the benchmarks are quite old and require modifications to work properly they can still be used and have the advantage of running in a short time.

The applications provided are related to the scientific world with examples of 3 body gravity simulators or some kernels like the LU decomposition of a matrix.

Since the original tests will not run, we used a modified version of the SPLASH-2 benchmark [19]. Even more modifications were required for the null macro to work properly and for the tests to be able to be run with Pin on hardened systems, these modifications are provided as a patch file in the source distribution.

The main reason for choosing these was that the relative performance results of these tests (although with more than one processor) were provided on [9] so we did not need to run the benchmarks again for Graphite and thus set up the required Debian environment.

# Chapter 3

# System description

Our application will be divided in 4 modules: Mead, a framework for providing an efficient message passing interface between different processes; Pint, a Pin based trace generator; Schnapps, a simple consumer of the traces; and Gin5, a gem5 trace player for the memory system.

The traces will be generated by Pint and then fed through Mead to either Schnapps or Gin5 which will process it and provide some simulation statistics.

### 3.1 Mead: a message passing framework

The pattern of message passing is not new and it conforms the base of some Object Oriented views. Mead will provide a fast and simple way of passing around the traces as messages stating that something has happened (for example the program made an execution memory access of size x at position y). These messages may contain the thread identifier of the thread that caused them and also attached data, for example in the case of a memory write the data available before writing and the data being written.

Although the API provided by mead is quite agnostic of the message

passing system being used we have chosen producer-consumer FIFO queues.

FIFOs are used since they are a known pattern which allows for easy implementation and migration over other interprocedural communication systems, if interprocess shared memory is not an option, like POSIX message queues or datagram sockets.

Our FIFO model differs slightly from the standard model since it allows for two communication types, on one hand you have the event communication system which can queue many events for further handling by the receiving side. On the other you will find a command interface able of holding a single command. The command interface requires acknowledging the sent command and is used to indicate important events which require specific handling by the queue system like the death of the FIFO or the beginning and ending of the simulation procedures.

The main difference between events and commands are that events are unidirectional (from the producer to the consumer) whilst commands can be used bidirectionally (as long as the absence of collisions is guaranteed by the programmer) and are more easily handled with the futex syscall which makes them very useful for events which will require a really heavy processing on the other side by allowing the other thread to preempt the CPU while this is done.

The FIFO architecture is based over a central FIFO (called the main FIFO) which is used to send global events which are supposed to stall the simulation until attended (so the simulator can decide whether it should clear or not the per thread queues before processing the aforementioned event), this queue handles at least the thread creation and deletion events where a new FIFO queue is negotiated between both sides, but it can also be used to process events like the creation and deletion of new mappings amongst

others.

Given the importance of the main FIFO in the architecture it is important that both processes know where to access it beforehand and are able to negotiate its creation independently of who arrived first (since synchronization is impossible before the FIFO creation).

The framework also features a per thread FIFO which can be used to send events which are not of global significance to the listener on the other side. This lets the programmer communicate information fast since the queues can be then lockless and, as a result, as long as there are at least two processors available the current process will not be changed by the kernel preventing expensive context switches. The creation of these FIFOs should ne negotiated over the main FIFO when implementing the multithreaded version.

### 3.2 Pint: a Pin based trace generator

Pint by itself it is not a simulator but a framework providing efficient ways to extract the data from the instrumented program through Mead. The version presented with this project is single threaded (although designed to be multithreaded and with part of the work for that already done) and relies on Mead for communicating with the simulator itself. As an example the provided instrumentation will study memory accesses made by the program (of any type ranging from prefetches to execution fetches) and sends them out with Mead so the simulator can prevent the issues associated with Pin tracing tools.

The code here is focused heavily on speed and thus the user must have the option to choose the features that should to be used.

The granularity of the execution can be easily tunned by setting an ap-

propriate queue size. For example for instruction by instruction execution the queue must have size one.

Also, a simulation started event must be the first one to be queued so you can discard old elements when you want the execution to be done.

Since all instructions will start (and contain) with a single fetch event it is possible to use this event as the differentiator between instructions. Anyway it is a good idea to integrate at least the number of events the instruction will cause to make tracing easier. This may be done on future versions.

Pint also provides a way to specify the number of instructions that must be executed before switching to the next simulation mode and thus you can provide the number of instructions that must be executed (by the sum of threads) before switching to another simulation mode.

The mode automaton allows for three simulation modes which are switched in the following order, the fast forward, the warm up and the simulation mode, which will then go back to the fast forward.

In the fast forward mode instructions are just accounted and executed but no data is generated which allows for near native speed execution. In the warm up mode and simulation mode instructions will generate events for filling the caches but the entrance and exit of the simulation status are notified to the consumer so it can handle statistics properly.

#### 3.3 Schnapps: a simple consumer of the traces

Schnapps is intended to be used mainly for analyzing the performance of the instrumentation code by consuming the events generated whilst trying to avoid causing any bottlenecks in execution, and also as an example program of how to extract the generated traces.

Schnapps reads the traces generated by Pint and outputs the map changes as they happen (in a diff like format) and some statistics for the current simulation and for the total run, in particular, amount of data read or written by the different memory access types, the number of said accesses that has happened and an execution mark made by xoring the different accesses' addresses together the idea being that different marks imply different traces being generated but the same mark does not necessarily imply the same trace being generated.

#### 3.4 Gin5: a gem5 trace player

Although previous versions of gem5 came with a trace player supporting different formats, these modules stopped being maintained long time ago and those stopped building, as a result and despite being a good base for starting the work given the big amount of changes the memory system has suffered since then a different base was necessary.

As a result we have set again a generic CPU for playing traces (missing the TLB) which will ask for memory access request to the queues via a clear interface so it can be used also with other types of trace formats including the old ones if the classes containing them are updated.

# Chapter 4

# System design

### 4.1 Mead: a message passing framework

Mead has a macro of particular interest: USE\_YIELD which will enable the use of the yield system call to let other process use the processor when waiting.

On mead, we have chosen to implement a lockless buffer ring over shared memory for our FIFOs since it is a well known pattern [4] [24].

The other reasons for choosing such an structure was speed and independence. By being lockless we avoid expensive spinlocks that would hinder performance whilst avoiding also having to either use the ones provided by Pin everywhere or building our own. Also having the data in shared memory will prevent us from making expensive system calls to have the messages passed and will allow the usage of cache for that.

It should be taken into account that the lockless queue will only work properly if a single thread acts as reader and a single thread acts as writer. In case of having more threads at either side they require a lock to work properly. Our implementation is based on templates so it can be used with different classes although it should be taken into mind that the same class (i.e. no inheritance) should be used on the whole queue.

Also one of the current major caveats is that the shared memory address is currently hardcoded and, as a result, only a single instance of the program can be started at the same time. We expect to fix this in future versions by providing a launcher that will allocate an anonymous shared memory segment and pass its identifier to both Pint and the trace reader being used.

The command types are defined by the shmstatus enum. Since newly allocated shared memory is filled with 0s we assume the 0 value as the initial state (NONE). The server will then write a SERVER\_STARTED command and wait for a CLIENT\_ACK then. Finally when dying the server is expected to send the SERVER\_DIED command so the client will not wait forever for data.

Of the many methods provided, those of special relevance for the programmer are the gethead and gettail methods used to be able to access the data we want to insert or extract from the queue, the push and pop methods used for adding or removing an element from the queue and the full and empty methods used to check for these states.

Also some methods for waiting in case the queue is empty/full are provided but these must be used carefully since if the consumer is singlethreaded it could hang waiting forever for the queue to match the condition. In this case special waits monitoring the main queue status too are recommend instead. Also a wait\_push method is provided that will wait until a push can be done.

For control handling we provide the send\_control, receive\_control and ack\_control methods. In particular send\_control will wait until an ACK is

sent back to state the condition was taken care of.

Finally the wait\_start and tell\_Start methods are provided for initialization and instead of yields they use calls to the futex syscall to lock the thread until they have been attended to reduce the processor load in some situations.

#### 4.2 Pint: a Pin based trace generator

In order to ensure unwanted features will not hinder performance preprocessor based switches can be used to disable those you are not interested in using, also some other options can be set in this way.

The macros of interest here are PADSIZE which defines the amount of bytes of the cache line in order to prevent false sharing, MAXMEMSIZE which defines the maximum size a single memory access may have (used for amongst other setting the size of the buffers), USE\_DATA which will enable the infrastructure for fetching and sending the accessed data in the events, MULTITHREADED which will enable the still incomplete multithreaded code, USE\_STATES which will enable the fast forward, warm up and simulation state machine and DTRACE which will make pint output some debugging information.

Given the impact these features can have we decided to allow the user disable them at compile time. Also some of these features can be disabled at run time although they will still have some impact on the execution, in particular, USE\_STATES will still cause the slowdowns of the conditionals introduced before the instrumentation calls to handle the state machine and USE\_DATA will make the event size, and thus the queues larger.

A final option that can be disabled is mapping tracing after a context

change (disabled by default) and after a syscall. The reason for this is the great slowdown caused by this operation since it requires at least 3 system calls in order to be executed and parsing a large text file.

In Pint the instrumentation is added by the Instruction function, when given the choice between adding complexity here or in the instrumentation functions we should add it here since this function is executed with much less frequency than the instrumentation code. As can be seen this function just tells Pin to add calls to the proper instrumentation functions, either with previous conditionals if the state machine is being used or without them otherwise.

Here we should consider all the parsemaps functions which are wrappers around the original parsemaps function that will take care of generating the events may maps be added or deleted. Also, as it can be seen, this function will consume quite a lot of resources given the way in which it works. Sadly the PIN framework on which pint is based does not provide any API in order to distinguish the mappings made by the instrumentations (including the JIT caches and the instrumentation code itself) as a result a lot of events will be generated on the simulation status queue. In order to reduce this overhead we assume mappings may only change after either coming back from a context change (as is the case when the application is being ptraced by a debugger) or coming back from a system call, this reduces the overhead greatly but still generates a lot of spurious mapping changes that may pollute the simulator assumptions. We expect this issue to be fixed with the addition of a proper API on future versions of PIN. Unlike memory access information given the importance of the mapping information it is sent independently of the simulation mode as it is generated.

In order to take track of the memory accesses the RecordMemExec,

RecordMemRead, RecordMemPrefetch and RecordMemPreWrite functions are used. Also when the user is insterested in the data generated by these functions, RecordMemPreWrite changes its behavior so it can access the memory information provided before the access and a new function called RecordMemWrite and executed after the instruction finishes is added, the reason for this is that the written data can not be known otherwise.

In order to handle the state machine we have an enum called state which contains the current simulation state, a function called nextState which takes care of handling the previous variable and the one with the instruction counter, and is called only when the instruction counter reaches zero, we also have the StateCounter method that will decrease the instruction counter by one and say whether we have processed the last instruction or, we also have CounterDone which sends the events for starting or ending a simulation and finally we have the Instrument function which check whether instrumentation code should or not be run in the current state.

We finally have a few callbacks, ThreadStart used to notify the creation of new threads, ThreadFini used to notify its destruction and Fini which is called before the instrumented program exits and will generate the SERVER DIED event.

### 4.3 Schnapps: a simple consumer of the traces

The code on Schnapps is all written on the main function given its simplicity.

First, the queues are negotiated with Pint, afterwards, the variables holding the stats are initialized to 0 and we state we are not simulating anything.

With that done we enter the main loop that will process information until the trace generator reports that it has died. In this loop, the data from the thread queue is extracted and added to the statistics. Afterwards, the main queue is checked for events like mappings being added/removed and these changes are printed. And finally control signals are handled properly, including the beginning of a simulation (by setting the stats to 0) and the end (by printing the simulation stats).

Finally, once outside of the loop and with the simulation finished, we print the total stats.

#### 4.4 Gin5: a gem5 trace player

The biggest amount of coding is likely to have been written in these classes since we had to revamp the trace readers and the trace CPUs so they would work with the current memory system used by gem5.

The MemTraceReader class is a very simple class providing a single method called getNextRequest that will provide either a pointer to the next Request to be played on the memory system or a NULL pointer along with the reason why it was provided.

The memory requests are represented by the MemTraceRequest which returns packets through the getNextPkt method.

The PinReader class is derived from the MemTraceReader class and aside from handling the Pint queues also adds some callback to delete the queues when done.

Finally the TraceCPU class provides the MemPort classes and the Tick-Event classes which are required by the simulator and is the responsible of requesting the data to the reader when necessary and sending the requests to the memory system through the proper port. From a CPU point of view it emulates a system without a TLB (we basically take the LSBs of the address to convert the virtual addresses we get into physical addresses) with ports for an instruction and a data cache.

An example gem5 configuration using this class is also provided in the pintrace.py file.

# Chapter 5

# Results

The benchmark results can be seen in the following table (extracted from the annexed .ods file).

It surprises us to get a slowdown as high as 804x in the case of LU and also the fact that fmm only got a 94x slowdown in the Graphite benchmarks. Anyway, if we discard the fmm benchmark we can see that our system performs better than graphite in all cases using a single processor.

| Application    | Graphite slowdown | Pint slowdown |
|----------------|-------------------|---------------|
| barnes         | N/A               | 287           |
| cholesky       | 346               | 361           |
| fft            | 3978              | 284           |
| fmm            | 94                | 322           |
| lu_cont        | 4007              | 557           |
| lu_non_cont    | 3061              | 804           |
| ocean_cont     | 515               | 317           |
| ocean_non_cont | 433               | 360           |
| radiosity      | N/A               | 498           |
| radix          | 1648              | 199           |
| raytrace       | N/A               | 279           |
| volrend        | N/A               | 404           |
| water_nsquared | 2465              | 509           |
| water_spatial  | 966               | 683           |

Table 5.1: Slowdown comparison between Graphite with 8 cores and Pint with one

# Chapter 6

### Conclusions

The project development has taken a long time given the research components it had yet, its development helped us have a good insight on how to improve simulators speed.

Also, given the promising results obtained with the benchmarks (worst case of 804 when simulating, best case of 199 with a mean of 360,5 and an average of 419) run during the development and testing of this fairly limited version we think that ideas like simulation segmentation and instrumentation based simulation on independent process may help to the development of faster and more powerful simulators and will continue with its development.

We expect to see in the future heavily multithreaded simulators where each processor has its own group of threads each handling the different stages independently in order to speed up execution times on multiprocessor machines.

We also expect to see in the future more simulators used the process based separation between the data collection routines responsible of the execution of the program and the simulation itself in order to allow for the usage of higher level languages with less restrictions whilst still providing high performance and native execution of the simulated programs.

### 6.1 Improvements for next release

In the next release we intend to have a fully parallel instrumentation framework, we will also reimplement the trace simulator as a full gem5 CPU so it can have proper TLB handling and can be extended internally with more complex models. Finally we will change the queuing system so the simulator knows how many events will be generated by the instruction being executed before these events are handled down. We will also interconnect Multi2sim with gem5 in order to prove the powerfulness of Mead.

Once we release the next version we intend to publish a paper on a publication on this topic.

# Appendix A

### User manual

### A.1 Building

In order to build the sources it suffices with running the make command on the sources directory.

#### A.2 Pint

Running the Pint pintool is quite easy and for that it is enough to run:

./pin -t source/tools/SimpleExamples/obj-intel64/pinatrace.so - command arguments

Options can be set by setting the desired switches between pinatrace. so and the -

Currently the following options are available:

-f number: adds the set number of instruction to be run in the fast forward state (used many times it will set more instruction counts to be run the next time we go back to said state)

-w number: adds the set number of instruction to be run in the warm up

state (used many times it will set more instruction counts to be run the next time we go back to said state)

- -s number: adds the set number of instruction to be run in the simulation state (used many times it will set more instruction counts to be run the next time we go back to said state)
- -syscallmap {0,1}: disables, if 0, or enables, if 1, the checking of process mappings after returning from a syscall
- -ctxchangemap {0,1}: disables, if 0, or enables, if 1, the checking of process mappings after a context change
- -values {0,1}: disables, if 0, or enables, if 1, the copying of data along with the memory events

### A.3 Schnapps

For running Schnapps just run ./consumer

### A.4 Gin5

Gin5 requires a python file setting the system to be emulated. An example of such system can be found in the pintrace.py file. Once you have set up your system on a python file you just need to run the gem5.fast binary followed by the file containing the system being defined.

Scripts to set up systems may take arguments from the command line if introduced after the script file. Our example file does not make use of this feature but others may.

# Appendix B

### Relevant source code

#### pinatrace.h

```
#ifndef PINATRACE_H
       #ifndef PINATRACE_H
#define PINATRACE_H
#include linux/futex.h>
#include <sys/ipc.h>
#include <sys/sem.h>
#include <sys/shm.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <unistd.h>
10
11
        #include <csignal>
       #include <cstdio>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <new>
12
13
14
15
16
17
18
                                                      \begin{array}{ll} -\_builtin\_expect (!!(x), 1) \\ -\_builtin\_expect (!!(x), 0) \end{array}
        #define unlikely(x)
19
20
                                       configuration
        21
22
23
24
25
26
27
28
29
30
        #ifdef USE_YIELD
#ifdef PIN_H
#define YIELD PIN_Yield
31
32
        #define YIELD FIN_Yield
#else
#include <sched.h>
#define YIELD sched_yield
#endif
#else
#define YIELD()
33
34
35
36
37
38
39
40
        #endif
\begin{array}{c} 41 \\ 42 \end{array}
        #define cachealigned __attribute__ ((aligned (PADSIZE)))
43
44
      #ifndef PIN_H
typedef void VOID;
typedef u_int32_t UINT32;
typedef u_int8_t UINT8;
#endif
\frac{45}{46}
47
48
49
        enum DataType { INVALDATA, STARTTH, ACCMEM };
```

```
enum AccessType {ACCEXEC, ACCREAD, ACCWRITE, ACCPREFETCH };
          // \ \#define \ PAD(n) \ ((((n) + (PADSIZE - 1)) \ / \ PADSIZE) \ * \ PADSIZE)
 54
        #include <cassert>
 56
         #ifdef DTRACE
 58
        #ifdef DTRACE
#define dcprintf(c,...) if(c) fprintf (stderr, __VA_ARGS__)
#define dprintf(...) fprintf (stderr, __VA_ARGS__)
#define dcputs(c,a) if(c) fputs((a),stderr)
#define dputs(a) fputs((a),stderr)
#else
#define dcprintf(...)
#define dcprintf(...)
#define dcputs(c,a)
#define dputs(a)
 59
 60
 62
 64
 66
         #endif
 68
         //TODO: use alignments instead of paddings //TODO: use other padded struct for the data from read to write {\tt class\ MemAccess} {
 70
 71
72
         private:

//We are not going to use derivate classes here for efficiency
                //We are not going to use derivate classes here for efficiency
AccessType type;
VOID * ea; // Effective address of the access
#ifdef USE_DATA
char data[MAXMEMSIZE]; // Contains either the data executed/read or the data contained before writing
char wdata[MAXMEMSIZE]; // This is valid only when the data access is a write contains the written data
#endif
UINT32 size; // Size of the access
 75
76
 77
78
 79
 80
                #enair
UINT32 size; // Size of the access
#ifdef MULTITHREADED
UINT32 tid; // The ID of the thread generating the access
 81
 82
 83
 84
                 #endif
                 #ifdef USE_DATA
inline void setData() {
    assert(PIN_SafeCopy(data, ea, size) == size);
 85
 86
 87
                 inline void copyData(const MemAccess &ma) {
   memcpy(data, ma.data, size);
   if (type == ACCWRITE)
 89
 91
                                       memcpy(wdata, ma.wdata, size);
 93
         #endif
public:
#ifdef USE_DATA
 95
                 inline void setWdata() {
    assert(PIN_SafeCopy(wdata, ea, size) == size);
 97
 99
                100
101
102
103
104
                                                                  #endif
105
                        this->type = type;
this->ea = ea;
this->size = size;
#ifdef MULTITHREADED
106
107
108
109
                        this->tid = tid;
#endif
110
111
                        #endif
#ifdef USE_DATA
if (type != ACCPREFETCH) {
    setData();
112
113
114
115
116
                        #endif
117
118
                 inline void MemAccessSet (const MemAccess &ma) {
                        ne void MemAccessSet
type = ma.type;
ea = ma.ea;
size = ma.size;
#ifdef MULTITHREADED
119
120
121
122
                        tid = ma.tid;
#endif
123
124
125
                        #ifdef USE_DATA
                         if (type != ACCPREFETCH) {
    copyData(ma);
126
128
129
130
                 void show(FILE *f); //Requires the C LOCK
inline AccessType getType() const { return type;}
inline void* getEA() const { return ea;}
132
```

```
#ifdef USE_DATA
                inline const void* getData() const { return data;}
inline const void* getWData() const { return wdata;}
135
136
                #endif
inline UINT32 getSize() const { return size;}
#ifdef MULTITHREADED
inline UINT32 getTid() const { return tid;}
#endif
137
139
140
141
142
        } cachealigned;
143
         union SimDataU {
144
                class MemAccess ma;
145
146
147
        class SimData {
private:
148
149
                DataType type;
SimDataU data;
150
151
152
                SimData() : type(INVALDATA) {
153
154
                SimData(DataType _type) : type(_type) {
155
156
                inline DataType getType () const {
157
158
                      return type;
159
160
                inline void setType (DataType _type) {
161
                       {\tt type} \; = \; {\tt \_type} \, ;
162
                inline MemAccess & getMa () {
163
                       type = ACCMEM;
return data.ma;
164
165
166
                inline const MemAccess & getCMa () const {
167
                       assert(type == ACCMEM);
return data.ma;
168
169
\begin{array}{c} \mathbf{170} \\ \mathbf{171} \end{array}
        };
172
         enum InstEventType {
                A InstEventType {
ADDMAPPING, //A mapping was added during the last context change/syscall
REMOVEMAPPING, //A mapping was removed during the last context change/syscall
ADDTHREAD, //A new execution thread has been spawned
REMOVETHREAD, //An execution thread has ceased existing
174
175
176
178
        };
179
        struct range {
    unsigned long int b; //begin
    unsigned long int e; //end
    inline bool operator < (const struct range &r) const {
        //There shouldn't be overlapping ranges (at least in theory);
}</pre>
180
181
182
183
184
185
186
187
        };
188
189
         union InstEventData {
190
                191
192
193
194
195
        class InstEvent {
private:
196
                InstEventType type;
InstEventData data;
197
198
199
         public:
                lic:
inline InstEvent() { }
inline void SetInstEvent (InstEventType _type, range _r) {
   type = _type;
   data.r = _r;
}
200
201
202
203
                inline InstEventType getType () const {
205
206
                       return type;
207
208
                inline range getRange () {
   assert(type == ADDMAPPING || type == REMOVEMAPPING);
   return data.r;
209
\begin{array}{c} \mathbf{211} \\ \mathbf{212} \end{array}
         } cachealigned;
             This is a class implementing lockless single producers single consumer queues They are very useful for fast efficient IPC through shared memory though you need to ensure the structure being queued has all the necessary data inside i.e. doesn't uses references.
213
214
215
```

```
// Currently we use them for two purpouses, passing events related to memory // mappings and threads between the instrumentation and the simulator and // passing around the memory acceses of each thread.
218
219
220
         //We only use QSIZE -1 thus there is always one element free for processing before queueing. #define NEXTQELEM(v) (((v) + 1) % QSIZE)
221
222
         enum shmstatus {NONE = 0,//Initial state
224
                n shmstatus {NONE = 0,//Initial state}
CLIENT_ACK=1, //The client confirms reception of previous state
SERVER_STARTED=2, //The server has just started
SERVER_DIED=3, //The server has died
//This ones refer to the next instruction pushed to the queue (so they include up until the ACCEXEC after that)
SERVER_SIM_START=4, //We are going to jump into simulation reset stats
SERVER_SIM_END=5 //We have ended simulation reset stats
225
226
227
228
229
230
231
232
233
         // A lockless single producer single consumer queue, with more than 1 you will need locks template <class T, int QSIZE=QSIZE> class SHMQ {
    T queue[QSIZE] cachealigned;
    volatile sig_atomic_t qhead cachealigned;
    volatile sig_atomic_t qtail cachealigned;
    // Elements are inserted on the head and removed from the tail like a snake.
    volatile sig_atomic_t control cachealigned;
234
235
236
237
238
239
240
241
242
         public:
                  inline SHMQ () : qhead(0), qtail(0) {
243
244
                  inline T & gethead() { return queue[qhead]; }
                 inline T & gettail() {
   assert(!empty());
245
246
247
                         return queue [qtail];
248
                 }
inline bool full() {return NEXTQELEM(qhead) == qtail; }
inline bool empty() {return qtail == qhead; }
//Wait for the queue not to be full
inline void wait_full() {
   while(unlikely(full())) YIELD();
}
249
250
251
252
253
                 //Wait for the queue not to be empty (If the server dies it will never be)
inline bool wait_empty_cond() {
   return (empty() && control == CLIENT_ACK);
255
256
257
                 inline void wait_empty() {
   while(unlikely(wait_empty_cond())) YIELD();
259
260
261
262
                  inline void wait_not_empty()
                         while (unlikely(!empty())) YIELD();
263
                 inline void push() {
   assert(!full());
   qhead = NEXTQELEM(qhead);
265
266
267
268
                    /Wait if necessary then
269
                 inline void wait_push() {
   wait_full();
270
271
272
                         push ();
273
                 inline void pop() {
   assert (!empty());
   qtail = NEXTQELEM(qtail);
274
275
276
277
                  inline enum shmstatus receive_control () {
   if (control == CLIENT_ACK) return NONE;
   return (enum shmstatus) control;
278
279
280
281
                 inline void ack_control () {
   while (unlikely(control == CLIENT_ACK)) YIELD();
282
283
284
                          control = CLIENT_ACK;
285
286
                 inline void send_control(enum shmstatus st) {
   assert(st != CLIENT_ACK); //For this we should use ack_control instead
                         control = st;
//Wait for the ACK
288
                         //Wait for the ACK
while (unlikely(control != CLIENT_ACK)) YIELD();
289
290
291
                 inline void wait_start() {
    sig_atomic_t control_;
292
                         while ((control = control) != SERVER_STARTED) syscall(SYS_futex, &control, FUTEX_WAIT, control_, NULL, NULL, 0);
control = CLIENT_ACK;
294
295
                          \verb|syscall(SYS_futex|, &control, FUTEX_WAKE, 1, NULL, NULL, 0);|\\
296
                 inline void tell_start() {
    sig_atomic_t control_;
298
```

```
300
                     control = SERVER_STARTED;
                     syscall(SYS_futex, &control,FUTEX_WAKE,1,NULL,NULL,0);
//Wait for the ACK
while ((control_ = control) != CLIENT_ACK) syscall(SYS_futex, &control,FUTEX_WAIT,SERVER_STARTED,NULL)
301
302
303
305
       };
306
       typedef SHMQ<SimData> SimDataq;
typedef SHMQ<InstEvent> InstEventq;
307
308
309
310
        SimDataq * server_init2();
       SimDataq * client_init2();

void server_fini2(SimDataq *q);

void client_fini2(SimDataq *q);
311
312
313
314
        //TODO: Fix the case where the client is the one doing the finalization
315
         TODO: access queues should be created dynamically and passed through the event queue
317
318
        SimDataq * get_q2(int &shmid) {
              Satisfied a get_q2(int comma);
SimDataq * q;
if ((shmid = shmget(2684, sizeof(SimDataq), IPC_CREAT | 0666)) < 0) {
    perror("shmget");</pre>
319
320
321
322
323

}
void *shm;
if ((shm = shmat(shmid, NULL, 0)) == (void *) -1) {
    perror("shmat");
    exit(1);
}

324
325
326
327
328
              q = static_cast < SimDataq *> (shm);
329
330
              return q;
       }
331
332
333
       SimDataq * server_init2() {
   int shmid;
   SimDataq * q = get_q2(shmid);
   new (q) SimDataq(); //We use a placement new so we have the SimDataq in the shared memory
   q->tell_start();
   return q;
}
334
335
336
338
339
              return q;
340
       }
       SimDataq * client_init2() {
   int shmid;
   SimDataq * q = get_q2(shmid);
   q->wait_start();
342
344
345
                                      , connected we tell the OS the segement can be deleted
346
347
              if (shmctl(shmid,IPC_RMID,NULL) < 0)
348
                    perror("shmctl");
349
              return q;
       }
350
351
       void server_fini2(SimDataq *q) {
    q->send_control(SERVER_DIED);
352
353
        }
354
355
        void client_fini2(SimDataq *q) {
   q->ack_control();
   q->~SimDataq();
356
357
358
359
360
       //TODO: with propper template usage this could get prettier
InstEventq * server_init();
InstEventq * client_init();
void server_fini(InstEventq *q);
void client_fini(InstEventq *q);
361
362
363
364
365
366
367
        InstEventq * get\_q(int \& shmid) \ \{
368
              if ((shmid = shmget(2687, sizeof(InstEventq), IPC_CREAT | 0666)) < 0) {
    perror("shmget");</pre>
369
370
371
                     exit (1);
               void *shm;
373
              if ((shm = shmat(shmid, NULL, 0)) == (void *) -1) {
    perror("shmat");
    exit(1);
375
377
378
              q = static_cast < InstEventq*>(shm);
379
              return q;
380
381
```

# Appendix B. Relevant source code

```
InstEventq * server_init() {
   int shmid;
   InstEventq * q = get_q(shmid);
   new (q) InstEventq(); //We use a placement new so we have the InstEventq in the shared memory
   q->tell_start();
   return q;
}
383
\frac{384}{385}
\begin{array}{c} 386 \\ 387 \end{array}
388
            }
390
            InstEventq * client_init() {
   int shmid;
   InstEventq * q = get_q(shmid);
   q->wait_start();
   //Since we are connected we tell the OS the segement can be deleted
   if (shmctl(shmid,IPC_RMID,NULL) < 0)
        perror("shmctl");
   return q;
}</pre>
391
392
393
\begin{array}{c} 394 \\ 395 \end{array}
396
397
398
399
            }
400
            void server_fini(InstEventq *q) {
   q->send_control(SERVER_DIED);
401
402
403
            }
404
405
406
             void client_fini(InstEventq *q) {
   q->ack_control();
   q->~InstEventq();
407
408
\begin{array}{c} 409 \\ 410 \end{array}
            #endif
411
```

# pinatrace.cpp

```
/*BEGIN\_LEGAL
                                             Intel Open Source License
      3
                                                Copyright (c) 2002-2011 Intel Corporation. All rights reserved.
                                     *
Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are
                                   * Redistributions of source code must retain the above copyright notice, 
* this list of conditions and the following disclaimer. Redistributions 
* in binary form must reproduce the above copyright notice, this list of 
* conditions and the following disclaimer in the documentation and/or 
* other materials provided with the distribution. Neither the name of 
* the Intel Corporation nor the names of its contributors may be used to 
* endorse or promote products derived from this software without 
* specific prior written permission.
  10
 \begin{array}{c} 11 \\ 12 \end{array}
  13
 15
  16
 17
18
19
                                     * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
                                              THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE INTEL OR ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  20
 21
  22
 23
24
25
26
27
28
29
 30
                                              END LEGAL */
 31
32
33
                                                      @ORIGINAL_AUTHOR: Robert Cohn
 34
 35
36
37
                                                        @file
38
39
                                                       This \ file \ contains \ an \ ISA-portable \ PIN \ tool \ for \ tracing \ memory \ accesses
 40
                           #include "pin.H"
#include "pinatrace.H"
 41
 42
 43
                           #include <iostream>
44
45
                            #include <set>
 46
                            #include <algorithm>
                                                                         when \ \ calling \ \ C \ \ and \ \ C\!+\!+ \ \ library \ \ functions
 48
                            PIN_LOCK c_lock;
50
 51
                            FILE *StatsFile;
 52
 53
                           // KNOB<string > KnobOutputFile(KNOB_MODE_WRITEONCE, "pintool", "o", "pinatrace.out", "specify trace file name");
KNOB_COMMENT fcomment( "pintool:trace", "Options_for_the_tracing_behaviour");
 54
 55
 56
 57
                                                                         USE_STATES
                            KNOB<br/>
VINT64> Knobf(KNOB_MODE_APPEND, "pintool:trace"
                          KNOB<br/>
KNOB (KNOB MODE APPEND, "pintool:trace", "w", "", "Number_of_instructions_to_fast_forward.uMust_be_used_as_many_times_as_-w_and the control of the
 58
 60
 61
 62
 63
                                                                                                                                                                                                         "Number \bot of \bot instructions \bot to \bot use \bot for \bot simulation . \bot Must \bot be \bot used \bot as \bot many \bot times \bot as \bot - f \bot at \bot simulation . \bot Must \bot be \bot used \bot as \bot many \bot times \bot as \bot - f \bot at \bot simulation . \bot Must \bot be \bot used \bot as \bot many \bot times \bot as \bot - f \bot at \bot simulation . \bot Must \bot be \bot used \bot as \bot many \bot times \bot as \bot - f \bot at \bot at \bot simulation . \bot Must \bot be \bot used \bot as \bot many \bot times \bot as \bot - f \bot at \bot at \bot - f \bot at \bot at \bot - f \bot -
                          64
 65
66
 67
 68
69
70
                                                                                                                                                                                                                                                                                                                          "Check_mappings_after_a_context_change");
                                                                                                                                                                                                                                                                                         "0",
                                                                                                                                                                                   ctxchangemap",
\begin{array}{c} 71 \\ 72 \end{array}
                           73
74
75
76
                           #endif
                            #ifdef MULTITHREADED
                           #Index Modernments | Principle | Principle
77
78
                            #define ReleaseQLock() ReleaseLock(&(h_lock))
 79
                           #define GetQLock(tid)
#define ReleaseQLock()
 81
```

```
#endif
      static SimDataq *q;
static InstEventq *iq;
 86
      static FILE *mout;
 88
      \mathbf{void} \ \mathsf{parsemaps}(\mathbf{void}) \ \{
 90
           FILE *f;

static set<range> prev;

set<range> s, rem, add;
 92
           94
 95
 96
 98
100
                int c;
int c;
while ((c = fgetc(f)) != '\n') putchar(c);
putchar('\n');
while (fgetc(f) != '\n');
102
103
104
           }; //We don't need the file any more so release the FD \,
105
106
107
            fclose(f);
108
           109
110
\begin{array}{c} 111 \\ 112 \end{array}
                 \begin{array}{l} \text{iq->gethead().SetInstEvent(REMOVEMAPPING,* it);} \\ \text{iq->wait\_push();} \\ printf("-\ \%lx-\%lx\ \backslash n\ ",it->b\ ,it->e); \end{array} 
113
114
115
116
117
118
            //Now notify additions
           119
121
123
124
125
           //Ok finally we'll make the old set this one prev = s;
127
128
129
      }
      void parsemaps1(THREADID _1, CONTEXT *_2, SYSCALL_STANDARD _3, VOID *_4) {
131
132
           fputs("Syscall!\n", mout);
parsemaps();
133
135
      void parsemaps2(THREADID _1, CONTEXT_CHANGE_REASON _2, const CONTEXT *_3, CONTEXT *_4, INT32 _5, VOID *_6) {
    fputs("Context_change!\n", mout);
136
137
           parsemaps();
138
139
      }
140
      void parsemaps3(void) {
    fputs("Initial!\n",mout);
    parsemaps();
141
142
143
144
145
146
      static char AccessType2Char (AccessType at) {
    switch (at) {
        case ACCEXEC: return 'X';
}
147
148
149
                case ACCREAD: return 'R';
case ACCWRITE: return 'W';
case ACCPREFETCH: return 'P';
default: return '?';
150
151
152
153
                 default: return
154
           }
155
      }
156
157
      158
           switch(size)
{
160
                {f case} \ 0:
162
164
                      /*\ \textit{TODO: Here we do some assuptions about sizes, a propper program should fill them properly*/}
```

```
166
                       case 1:
                             {\tt fprintf(f,"0x\%02hhx",(*static\_cast<\!UINT8*\!>\!(data)));}\\
167
168
                             break;
169
170
                       case 2:
                             \texttt{fprintf(f,"0x\%04hx",(*static\_cast}\!<\!\!\text{UINT16*}\!>\!\!(\,\text{data}\,)\,)\,)\,;
171
                             break;
173
                             ...
fprintf(f, "0x%08x",(*static_cast<UINT32*>(data)));
break;
174
175
\begin{array}{c} 177 \\ 178 \end{array}
                             fprintf(f, "0x%0161x",(*static_cast<UINT64*>(data)));
break;
179
180
181
                       182
183
184
185
186
                                        {\tt fprintf(f,"\%02hhx",(static\_cast<\!UINT8*>(data)[i]));}\\
187
188
189
190
                             break;
            }
191
192
       }
#endif
193
194
       void MemAccess::show(FILE *f) {
195
196
            fprintf(f,
                       #ifdef MULTITHREADED
197
                       "%10u"
#endif
198
199
                       "u%cu%#0161x%u3du",
#ifdef MULTITHREADED
200
201
202
                        (UINT32) tid,
                       #endif
204
                       AccessType2Char(type), (unsigned long int) ea, size);
            AccessType2Char(type), (unsigned)
#ifdef USE_DATA
if (KnobValues) {
    if (type != ACCPREFETCH) {
        EmitMem(f, data, size);
        if (type == ACCWRTE) {
            fputs("_->_",f);
            EmitMem(f, wdata, size);
    }
205
206
207
208
209
210
211
212
                       }
213
                  }
214
            ⊭endif
215
216
            fputs("\n",f);
217
       }
218
219
       static INT32 Usage()
220
            "This_tool_produces_a_memory_address_trace.\n" "For_each_memory_access_(execute/read/write/prefetch)_the_ea_is_recorded\n" \n", stderr);
221
222
223
224
225
226
            fputs(KNOB_BASE::StringKnobSummary().c_str(), stderr);
227
228
            fputs("\n", stderr);
229
230
            return -1;
231
       }
232
       233
234
235
                                          #endif
237
238
       {
              /TODO: use a per thread lockless queue
239
            GetQLock(tid+1);
q->gethead().getMa().MemAccessSet(ACCEXEC, ip, size
#ifdef MULTITHREADED
240
241
242
243
            , tid
#endif
245
            iq->wait_not_empty();
            q->wait_push();
ReleaseQLock();
247
```

```
249
250
251
                   {\bf static} \ \ {\bf VOID} \ \ {\bf PIN\_FAST\_ANALYSIS\_CALL} \ \ {\bf RecordMemRead} \\ ({\bf VOID} \ * \ {\bf ea} \ , \ \ {\bf UINT32} \ \ {\bf size} \\
                                                                                                                \begin{array}{l} \# \operatorname{i} \, f \operatorname{d} \operatorname{e} \, f \quad \operatorname{MULTITHREADED} \\ , \quad THREADID \quad t \operatorname{i} \operatorname{d} \end{array}
252
254
                                                                                                                 #endif
256
                  {
                                //TODO: use a per thread lockless queue
GetQLock(tid+1);
q->gethead().getMa().MemAccessSet(ACCREAD, ea, size
#ifdef MULTITHREADED
257
258
260
261
                                 #endif
262
                                 iq->wait_not_empty();
q->wait_push();
ReleaseQLock();
264
265
266
267
268
                  static VOID PIN_FAST_ANALYSIS_CALL RecordMemPrefetch(VOID * ea, UINT32 size
    #ifdef MULTTHREADED
    , THREADID tid
269
270
271
272
                                                                                                                               #endif
273
274
                  {
275
                                                                                      per thread lockless queue
                                //10DU: use a per thread lockless queue \operatorname{GetQLock}(\operatorname{tid}+1); q->gethead().getMa().MemAccessSet(ACCPREFETCH,ea, size #ifdef MULTITHREADED
276
\begin{array}{c} \mathbf{277} \\ \mathbf{278} \end{array}
                                ", tid
#endif
279
280
281
282
                                 iq->wait_not_empty();
                                 q->wait_push();
ReleaseQLock();
283
284
285
                  }
286
287
                  {\bf static} \ \ {\bf VOID} \ \ {\bf PIN\_FAST\_ANALYSIS\_CALL} \ \ {\bf RecordMemPreWrite(VOID * ea} \ , \ \ {\bf UINT32 \ \ size}
289
                                                                                                                               #ifdef MULTITHREADED
290
291
                                                                                                                                , THREADID tid
                                                                                                                               #endif
293
                  {
                                #ifdef USE_DATA
#ifdef MULTITHREADED
MemAccess *ma = static_cast<MemAccess *>(PIN_GetThreadData(wMemAccess, tid));
ma->MemAccessSet(ACCWRITE, ea, size, tid);
...
294
295
297
298
                                #endif (). getMa(). MemAccessSet(ACCWRITE, ea, size);
299
300
301
                                GetQLock(tid+1);
q->gethead().getMa().MemAccessSet(ACCWRITE,ea,size
#ifdef MULTITHREADED
302
303
304
305
                                   .tid
                                 #endif
306
307
                                iq->wait_not_empty();
q->wait_push();
ReleaseQLock();
#endif
308
309
310
311
312
313
                  }
314
315
\frac{316}{317}
                 #ifdef USE_DATA
static VOID PIN_FAST_ANALYSIS_CALL RecordMemWrite(
318
                                                                                                                   \begin{array}{l} \#i\,\overline{f}\,d\,e\,f \quad MULTITHREADED \\ THREADID \quad t\,i\,d \end{array}
320
                                                                                                                    \#endif
322
                  {
323
                                 #ifdef MULTITHREADED
                                ##INTELL REPORTS AND ACCESS **NOTE TO THE PROPERTY OF THE PROP
324
326
                                #else
#ifdef USE_DATA
328
                                 q->gethead().getMa().setWdata();
330
```

```
#endif
333
             //TODO: use a per thread lockless queue
GetQLock(tid+1);
#ifdef MULTITHREADED
334
335
337
             q->gethead().getMa().MemAccessSet(*ma);
             #endif
             #endif
iq->wait_not_empty();
q->wait_push();
ReleaseQLock();
339
340
341
       #endif
343
344
       #ifdef USE_STATES
345
       static enum state { FASTFORWARD = 0, WARMING = 1, SIMULATION = 2} state = SIMULATION;
static UINT32 fIndex = 0;
static UINT32 wIndex = 0;
static UINT32 wIndex = 0;
347
349
       static UINT32 sIndex = 0;
351
352
       inline VOID nextState() {
            353
354
355
                              inscount = Knobw.Value(wIndex);
wIndex++;
state = WARMING;
break;
356
357
358
359
                         case WARMING:
   inscount = Knobs.Value(sIndex);
360
361
                         sIndex++;
state = SIMULATION;
break;
case SIMULATION:
362
363
364
365
                               //If we are done simulating stop if (Knobf.NumberOfValues() == fIndex)
366
367
                               \begin{array}{l} \text{PIN\_ExitApplication}\left(0\right);\\ \text{inscount} = \text{Knobf.Value}\left(\text{fIndex}\right); \end{array}
368
369
                               fIndex++;
state = FASTFORWARD;
370
372
                               break;
                         default:
                               fputs("Unknownustate\n", stderr);
PIN_ExitApplication(1);
374
375
376
                               break;
377
             } while (inscount == 0);
378
379
380
       382
383
384
                                                 #endif
385
386
       {
             inscount --;
dprintf("ins_%p!\n",ip);
if (inscount == 0)
    dputs("switch!\n");
return inscount == 0;
387
388
389
390
391
392
       393
394
395
396
397
       #endif
398
399
400
             return state != FASTFORWARD;
401
       }
402
403
       static VOID PIN_FAST_ANALYSIS_CALL CounterDone(
#ifdef MULTITHREADED
THREADID tid
404
405
406
407
                                           #endif
408
409
             enum state orig = state;
if (orig == SIMULATION) {
    q->send_control(SERVER_SIM_END);
411
412
413
             nextState();
```

```
415
             if (state == SIMULATION)
\begin{array}{c} 416 \\ 417 \end{array}
                  q->send_control(SERVER_SIM_START);
        dprintf("state:%d->%d\n", orig, state);
//We only want to change the instrumentation when switching from any state to the fast forward state or viceversa
// if ((orig == FASTFORWARD && state != FASTFORWARD)||(orig != FASTFORWARD && state == FASTFORWARD))
//TODO: this isn't working as expected :( (yet)
// PIN_RemoveInstrumentation (); //reinstrument the program
418
419
420
421
422
423
       }
#endif
424
425
426
         /Instrumentation
427
       static VOID Instruction (INS ins, VOID *v)
428
       . //Also using the IF — then callback system make PIN more likely to inline the counter code \#ifdef\ USE\_STATES
429
430
             INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)StateCounter, IARG_FAST_ANALYSIS_CALL, IARG_INST_PTR,
431
432
                                    #ifdef MULTITHREADED IARG_THREAD_ID,
433
434
435
                                     #endif
             IARG_END);
INS_InsertThenCall(ins, IPOINT_BEFORE, (AFUNPTR)CounterDone, IARG_FAST_ANALYSIS_CALL,
#ifdef MULTITHREADED
436
437
438
439
                                       IARG_THREAD_ID,
440
                                       #endif
\begin{array}{c} 441 \\ 442 \end{array}
                                       IARG_END);
       #endif
443
                              != FASTFORWARD)  {
       #ifdef USE_STATES
444
                  INS_InsertIfCall(ins, IPOINT_BEFORE, (AFUNPTR)Instrument, IARG_FAST_ANALYSIS_CALL,
#ifdef MULTITHREADED
IARG_THREAD_ID,
#endif
445
446
447
448
449
                                          IARG_END);
450
                  INS_InsertThenCall
451
       #else
452
                  INS\_InsertCall
453
       #endif
454
                                                IPOINT\_BEFORE, \ \ (AFUNPTR) \\ RecordMemExec \\ , \ \ IARG\_FAST\_ANALYSIS\_CALL \\ ,
                                       HARG_UNT32, INS_Size(ins),
#ifdef MULTITHREADED
455
456
457
458
                                        IARG_THREAD_ID,
                                       #endif
IARG_END);
459
460
461
462
                   if (INS_IsMemoryRead(ins))
463
464
       #ifdef USE_STATES
                        INS_InsertIfPredicatedCall(ins, IPOINT_BEFORE, (AFUNPTR)Instrument, IARG_FAST_ANALYSIS_CALL,
465
                                                #ifdef MULTITHREADED
IARG_THREAD_ID,
466
467
468
                                                #endif
                                                IARG END):
469
470
                        INS_InsertThenPredicatedCall
       #else
471
472
                        INS\_InsertPredicatedCall
473
       #endif
\frac{474}{475}
                                                            (ins , IPOINT_BEFORE, (AFUNPTR)(INS_IsPrefetch(ins)?RecordMemPrefetch:RecordMemRead), IARG_FAST_ANALYS
                                                           IARG_MEMORYREAD_EA,
IARG_MEMORYREAD_SIZE
476 \\ 477
                                                           #ifdef MULTITHREADED
IARG_THREAD_ID,
#endif
478
479
480
                                                           IARG_END);
481
482
483
484
                   if (INS_HasMemoryRead2(ins))
485
       #ifdef USE_STATES
486
                        INS_InsertIfPredicatedCall(ins, IPOINT_BEFORE, (AFUNPTR)Instrument, IARG_FAST_ANALYSIS_CALL, #ifdef MULTITHREADED
487
488
489
                                                              "ARG_THREAD_ID,
490
                                                              #endif
                                                              IARG_END);
491
492
                        INS InsertThenPredicatedCall
493
       #else
                        INS InsertPredicatedCall
494
495
       #endif
                                                                   IPOINT BEFORE, (AFUNPTR) (INS IsPrefetch (ins)? RecordMemPrefetch: RecordMemRe
496
                                                           IARG_MEMORYREAD2_EA,
```

```
IARG_MEMORYREAD_SIZE,
                                                                  \begin{array}{l} \# \operatorname{ifdef} \ \operatorname{MULTITHREADED} \\ \operatorname{IARG\_THREAD\_ID}, \end{array}
499
500
                                                                  #endif
IARG_END);
501
503
                    }
504
                        instruments stores using a predicated call, i.e. the call happens iff the store will be actually executed
505
                     if (INS_IsMemoryWrite(ins))
506
507
        #ifdef USE_STATES
509
                           INS_InsertIfPredicatedCall(ins, IPOINT_BEFORE, (AFUNPTR)Instrument, IARG_FAST_ANALYSIS_CALL,
#ifdef MULTITHREADED
IARG_THREAD_ID,
#endif
511
512
513
                                                                     IARG_END);
                           INS InsertThenPredicatedCall
515
       #else
517
                           INS InsertPredicatedCall
518
        #endif
                                                                  (ins, IPOINT_BEFORE, (AFUNPTR) RecordMemPreWrite, IARG_FAST_ANALYSIS_CALLARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE,
519
520
521
                                                                  #ifdef MULTITHREADED IARG_THREAD_ID,
522
523
                                                                  #endif
IARG_END);
524
525
        #ifdef USE_DATA
#ifdef USE_STATES
526
527
                          INS_InsertIfPredicatedCall(ins, IPOINT_BEFORE, (AFUNPTR)Instrument, IARG_FAST_ANALYSIS_CALL,
#ifdef MULITTHREADED
IARG_THREAD_ID,
#endif
528
529
530
531
532
                                                                     IARG_END);
533
                           INS\_InsertThenPredicatedCall
534
        #else
                           INS_InsertPredicatedCall
536
        #endif
                                                                 (!INS_HasFallThrough(ins)?IPOINT_TAKEN_BRANCH:IPOINT_AFTER), (!INS_HasFallThrough(ins)?IPOINT_TAKEN_BRANCH:IPOINT_AFTER), (AFUNPTR)RecordMemWrite, IARG_FAST_ANALYSIS_CALL, #ifdef MULTITHREADED IARG_THREAD_ID,
538
539
540
                                                                  #endif
IARG_END);
542
543
        #endif
544
545
              //}
546
547
548
549
        // Multithread stuff:
550
551
       #ifdef MULTITHREADED
552
        553
554
        {
              MemAccess *ma = new MemAccess();
PIN_SetThreadData(wMemAccess, ma, tid);
555
556
557
558
559
        static VOID ThreadFini(THREADID tid, const CONTEXT *ctxt, INT32 code, VOID *v)
560
561
        {
              MemAccess *ma = static_cast<MemAccess *>(PIN_GetThreadData(wMemAccess, tid));
562
563
               delete ma;
564
        #endif
565
566
        //TODO: maybe integrate this into the queue class and the socket per thread protocol 
// static bool ending = false; 
// static THREADID processor; 
// static PIN_THREAD_UID processoruid;
567
568
569
570
571
                   tc VOID ProcessQueue (VOID *nothing) {
THREADID tid = PIN_ThreadId();
while (!ending || !q->empty()) {
  GetLock(&c_lock, tid);
  while (!q->empty()) {
573
575
576
577
                               q \rightarrow g e t t a i l ();
                               q \rightarrow p \circ p ();
579
                          ReleaseLock(&c_lock);
```

```
//Let others fill the queue YIELD();
582
583
584
                 \begin{array}{lll} \textbf{static} & \text{VOID} & \text{Fini} \left( \text{INT32} & \text{code} \;, \; \text{VOID} \; *v \right) \\ \{ \end{array}
586
587
                                          ending = true; \\ PIN\_WaitForThreadTermination \ (processoruid , PIN\_INFINITE\_TIMEOUT, NULL); \\ \\ \frac{1}{2} \frac{1
588
589
                                if (KnobSyscallMap || KnobCtxChangeMap)
590
                                             fclose (mout);
                               server_fini2(q);
server_fini(iq);
592
593
594
                }
595
                 int main(int argc, char *argv[])
596
597
                                if( PIN_Init(argc, argv) )
598
599
                               {
600
                                             return Usage();
601
                               }
602
603
                #ifdef USE_STATES
                               if (!(Knobf.NumberOfValues() == Knobw.NumberOfValues() && Knobf.NumberOfValues()==Knobs.NumberOfValues()))
604
605
                                             fputs ("The_number_of_occurrences_of_-f_-h_and_-s_must_be_the_same.", stderr);
606
607
                                              return Usage();
608
609
                 #endif
610
611
                               i\,q\!=\!s\,e\,r\,v\,e\,r\,\_\,i\,n\,i\,t\,\,(\,)\,;
                               q=server_init2();
q->gethead().setType(STARTTH);//TODO move to the thread start callbacks
612
613
614
                               q->wait_push ();
615
616
                #ifdef USE_STATES
                               if(Knobf.NumberOfValues() >= 1)
    nextState();
//This one is done due to the way instrumentation works
617
619
                               inscount++;
//Send the simu start command if necessary
if(state == SIMULATION)
621
622
623
                #endif
                                           q-\!\!>\!\!\mathrm{send\_control}\left(\mathrm{SERVER\_SIM\_START}\right);
624
                                                                                                                                                                   0);
                               INS_AddInstrumentFunction(Instruction, PIN_AddFiniUnlockedFunction(Fini, 0);
625
626
627
                               //Open the output file
if (KnobSyscallMap || KnobCtxChangeMap)
   mout = fopen("maptrace.txt","w");
629
630
631
632
                                                               syscalls and so for mapping changes
633
634
                                if (KnobSyscallMap)
                                             PIN_AddSyscallExitFunction(parsemaps1, NULL);
onitor also after context changes since if we are ptraced mappings may have changed
635
636
                                //Monitor also after if (KnobCtxChangeMap)
637
                                 PIN_AddContextChangeFunction (parsemaps2, NULL);
//Although pin uses codecaches it hides this details from the instrumentation code so our instructions caches don
638
639
                  break
                               //This means the instruction addresses we get are mapped to the mappings corresponding to the libraries and not t
//code so we don't have to worry about changes to these mappings, but, since we still can't discern them from app
//mappings we still have to reserve space for them in the simulator space. This also means we'll be having some n
//in the map space almost always until PIN provides an api to discern pin/tool mappings from application ones.
640
641 \\ 642
643
644
645
                                   /Thread Callbacks
                                InitLock(&c_lock);
646
647
648
                               #ifdef MULTITHREADED
                               InitLock (& h_lock);
wMemAccess = PIN_CreateThreadDataKey(0);
649
650
                               PIN_AddThreadStartFunction(ThreadStart, 0
PIN_AddThreadFiniFunction(ThreadFini, 0);
651
652
                               #endif
653
654
                               //Start queue processor thread
processor = PIN_SpaunInternalThread ( ProcessQueue, NULL, 0, &processoruid);
if (processor == INVALID_THREADID) return 1;
//Initial map loading
if (KnobSyscallMap || KnobCtxChangeMap)
655
656
657
658
659
                               parsemaps3();
PIN_StartProgram();
660
661
```

```
663 return 0;
664 }
```

```
#include <pinatrace.h;
#include <cstdint>
#include <cstdint>
#include <cstdint>
#include <cinttypes>

int main() {
    iq = client_init();
    SimDataq +q;
    q = client_init2();
    uint64_t nins = 0;
    uint64_t nins = 0;
    uint64_t nins2 = 0;
    uint64_t sins 
                                                                                                                                                                                                                                                                                                                                                  npre ++;
npre2++;
                                                                                                                                                                                                                                                                                                                                                    spre += ma.getSize();
spre2 += ma.getSize();
                                                                                     58
                                                                                     60
                                                                                                                                                                                                                                                                                                                                                    break:
                                                                                                                                                                                                                                                                                                                default:
                                                                                                                                                                                                                                                                                                                                                    puts("Unexpected access type!");
                                                                                     62
                                                                                                                                                                                                                                                                         }
                                                                                      64
                                                                                      65
                                                                                                                                                                                                                                       q->pop();
                                                                                     66
                                                                                                                                                                                                  }
//Have we just emptied the buffer or has an event happened?
while(!iq->empty()) {
    if (iq->gettail().getType() == REMOVEMAPPING){
        range r=iq->gettail().getRange();
        printf("-_wlx-\langle \langle \nu, r.b, r.e);
    } else if (iq->gettail().getType() == ADDMAPPING){
        range r=iq->gettail().getRange();
        printf("+_wlx-\langle \langle \nu, r.b, r.e);
}
                                                                                      68
                                                                                     69
70
                                                                                     \frac{73}{74}
                                                                                                                                                                                                                                         iq->pop();
                                                                                      76
                                                                                                                                                                                                 79
```

```
simulating = false;
                                                                              simulating = false;
puts ("Simulation_statistics:");
puts ("Number_of_accesses:");
printf("uninstructions: _%" PRIu64"
printf("un reads_ununun: _%" PRIu64"
printf("un writes_ununun: _%" PRIu64"
printf("un total_ununun: _%" PRIu64"
   84
85
                                                                                                                                                                                                         \n",nins);
\n",nrea);
\n",nwri);
\n",npre);
\n",nins+nrea+nwri+npre);
typeu(bytes):");
\n" sins).
   86
87
   88
   89
90
                                                                                                         "Untotal unununus "PRIu64" \n
"Total accessed unemory by thy
"un instructions: _%" PRIu64" \n
"un reads unununus _%" PRIu64" \n
   91
92
                                                                              puts ("printf("
                                                                                                                                                                                                                 , sins);
, srea);
   93
                                                                               printf (
   94
95
96
                                                                              printf("unwritesununuu: "%" PRIu64"\n", swri);
printf("un prefetchesun: "%" PRIu64"\n", spre);
printf("untotalununuu: "%" PRIu64"\n", sins+srea+swri+spre);
printf("Execution_mark: "%" PRIx64"\n", mark);
   97
98
                                                                              q->ack_control ();
99
100
                                                               break;
case SERVER_SIM_START:

\begin{array}{ll}
\text{nins} &= 0;\\
\text{nrea} &= 0;
\end{array}

101
102
103
                                                                               nwri = 0;
                                                                              npre = 0;
sins = 0;
srea = 0;
104
105
106
107
                                                                               swri = 0;
                                                                              spre = 0;

mark = 0;
108
109
                                                                               simulating = true;
110
\frac{111}{112}
                                                                              q->ack_control ();
break;
113
                                                //Wait for buffer to refill
while(q->wait_empty_cond() && iq->wait_empty_cond()) YIELD();
114
\begin{array}{c} 115 \\ 116 \end{array}
                               puts ("Total_statistics:");
puts ("Number_of_accesses:");
printf("_uu_instructions:_%"PRIu64"\n",nins2);
printf("_uu_reads_uu_uuu:_%"PRIu64"\n",nwri2);
printf("uu_prefetches_uu:_%"PRIu64"\n",nwri2);
printf("uu_prefetches_uu:_%"PRIu64"\n",npre2);
printf("uu_total_uuuuu:_%"PRIu64"\n",nins2+nrea2+nwri2+npre2);
puts ("Total_accessed_memory_by_type_u(bytes):");
printf("uu_instructions:_%"PRIu64"\n",sins2);
printf("uu_total_uuuuu:_%"PRIu64"\n",srea2);
printf("uu_total_uuuuu:_%"PRIu64"\n",srea2);
printf("uu_total_uuuuuu:_%"PRIu64"\n",spre2);
printf("uu_total_uuuuuu:_%"PRIu64"\n",sins2+srea2+swri2+spre2);
printf("uu_total_uuuuuuu:_%"PRIu64"\n",sins2+srea2+swri2+spre2);
printf("Execution_mark:_%"PRIu64"\n",mark2);
client__fini(q);
return 0;
\begin{array}{c} 117 \\ 118 \end{array}
119
120
121
123
124
125
126
127
128
129
130
131
132
133
134
                  }
```

# mem\_trace\_reader.hh

```
Copyright (c) 2004-2005 The Regents of The University of Michigan
  \frac{3}{4}
                 Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions and the: redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer; redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution; neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
  5
6
10
11
12
13
               THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
15
16
17
19
21
23
24
25
28
29
                 Authors: Erik Hallnor
30
31
32
                Definitions for a pure virtual interface to a memory trace reader.
33
         #ifndef __MEM_TRACE_READER_HH_
#define __MEM_TRACE_READER_HH_
35
36
37
         #include "mem/packet.hh"
#include "mem/request.hh"
#include "params/MemTraceReader.hh"
#include "sim/sim_object.hh"
38
39
40
41
42
43
\frac{44}{45}
           * This class contains the info of the trace request and some useful methods to * split it
46
          class MemTraceRequest : public FastAlloc {
                   ss Mem'TraceRequest: pu
Addr _paddr;
unsigned _size;
Request::Flags _flags;
Tick _time;
int _asid;
Addr _vaddr;
int _contextId;
int _threadId;
Addr _pc;
MemCmd _cmd;
48
50
52
53
54
                   Addr _pc;
MemCmd _cmd;
56
58
          public:
                   60
                    62
63
64
65
                    svemcmd::Command cmd)
: _paddr(paddr), _size(size), _flags(flags), _time(time), _cmd(cmd)
{ }
66
68
69
70
                    ~MemTraceRequest() {} // for FastAlloc
\frac{73}{74}
                      * Are we scheduled to run already
                    inline bool mustRun () {
   return _time <= curTick();</pre>
75
76
77
78
                   inline Tick time() {
79
                             return _time;
```

```
inline bool isInstFetch () {
   return _flags.isSet(Request::INST_FETCH);
 84
85
  86
               inline bool lastPacketSent () {
  88
                      \begin{array}{lll} \textbf{return} & \_\texttt{size} \ == \ 0 \, ; \end{array}
  89
               }
 90
 91
92
                \ast Get the next packet with proper bounds for this block size \ast Will return NULL when done
  93
 94
               PacketPtr getNextPkt (int bsize, Packet::NodeID dest, MasterID mid) {
    if (lastPacketSent()) {
        return NULL;
  95
 96
 97
98
                      }
                      }
//Base address of the block
Addr base = (_paddr & ~(bsize - 1));
//Current block massize
int msize = bsize - (_paddr - base);
99
100
101
102
                     //Minimum

if (msize > _size) msize = _size;
//Generate tthe request and the packet
RequestPtr req = new Request(_paddr, msize, _flags, mid);
PacketPtr pkt = new Packet(req,_cmd,dest);
pkt->dataDynamicArray(new char[msize]);
//Calculate the new base address and size
_paddr += msize;
_size -= msize;
return pkt;
103
104
105
106
107
108
109
110
111
112
113
               }
        };
114
\frac{115}{116}
        typedef MemTraceRequest * MemTraceRequestPtr;
117
118
119
120
          * Pure virtual base class for memory trace readers.
121
         class MemTraceReader : public SimObject
123
124
                enum reason {EOT,STAT_RESET,STAT_DUMP};
125
126
               127
128
         //TODO: redo doc pkt should contain time, request, command and data.
129
              /**

* Read the next request from the trace. Returns the request in the

* provided RequestPtr and the cycle of the request in the return value.

* @param req Return the next request from the trace.

* @return The cycle of the request, 0 if none in trace.
130
131
132
133
134
135
136
               137
138
139
        #endif //__MEM_TRACE_READER_HH__
```

# pin\_reader.hh

```
* Copyright (c) 2004-2005 The Regents of The University of Michigan * All rights reserved.
                Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions a met: redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer; redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution; neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
  6
7
  8
10
12
14
                 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
16
                "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
18
           **LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR

**A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT

**OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,

**SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT)

**LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,

**DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY

**THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

**(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE

**OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
20
21
22
24
25
26
28
            * Authors: Erik Hallnor
29
30
31
32
33
            *\ \ Definition\ \ of\ \ a\ \ memory\ \ trace\ \ reader\ for\ \ a\ \ M5\ \ memory\ \ trace\ .
34
35
        #ifndef __Pin_READER_HH_
#define __Pin_READER_HH_
36
37
         #include "cpu/trace/reader/mem_trace_reader.hh"
#include "cpu/trace/reader/pin_atrace.hh"
#include "params/PinReader.hh"
39
41
42
43
            * A memory trace reader for a pin memory trace.
45
          class PinReader : public MemTraceReader
47
                   friend class DeleteQueuesCallback;
                             The trace. */
49
                   /** Ine trace. */
SimDataq *q;
/** Information on mapping changes */
50
51
                   InstEventq *iq;
                  bool simulating; // Wether we are in simulation state or not bool drop; // Should we drop the next element (has it been processed)
53
55
              void removeQueues();
public:
57
58
                   59
61
62
63
                   PinReader(const PinReaderParams *p);
64
                  ~PinReader();
65
66
                   // TODO: \ redo \ doc \ pkt \ should \ contain \ time , \ request , \ command \ and \ data .
67
68
                   /**

* Read the next request from the trace. Returns the request in the

* provided RequestPtr and the cycle of the request in the return value.

* @param req Return the next request from the trace.
69
\begin{array}{c} \mathbf{70} \\ \mathbf{71} \end{array}
72
73
                    * @return The cycle of the request, 0 if none in trace.
                   virtual MemTraceRequestPtr getNextRequest(MemTraceReader::reason &reason);
74
75
76
         };
         #endif // __PIN_READER_HH__
```

# pin\_reader.cc

```
/*
* Copyright (c) 2004-2005 The Regents of The University of Michigan
  _{\mathbf{4}}^{\mathbf{3}}
             * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions a * met: redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer; redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution; neither the name of the copyright holders nor the names of its * contributors may be used to endorse or promote products derived from * this software without specific prior written permission.
  5
6
10
\begin{array}{c} 11 \\ 12 \end{array}
\begin{array}{c} \mathbf{13} \\ \mathbf{14} \end{array}
15 \\ 16 \\ 17
            * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS

* "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT

* LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR

* A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT

* OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,

* SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT

* LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,

* DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY

* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT

* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE

* OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
18
19
20
21
22
23
24
25
26
27
28
29
                 Authors: Erik Hallnor
30
31
32
                 @file
             * Declaration of a memory trace reader for a pin memory trace.
33
34
35
          #include "base/callback.hh"
#include "cpu/trace/reader/pin_reader.hh"
#include "sim/sim_exit.hh"
36
37
38
39
          #include <set>
40
41
          //TODO: look why the user interrupt received event doesn't calls the Callback
42
          /** Callback to clean the queues*/
class DeleteQueuesCallback : public Callback {
public:
43
44
45
46
                   DeleteQueuesCallback();
                    void process();
48
49
          static DeleteQueuesCallback dqc;
50
51
           /** List of PinReader elements for the queue deleting callback **/
52
53
          static std::set<PinReader *> readers;
54
55
          DeleteQueuesCallback::DeleteQueuesCallback() \ \{
                    registerExitCallback(this);
56
57
          }
58
          59
                                                                               *>::iterator it = readers.begin(); it != readers.end(); it++) {
60
61
62
63
          }
64
65
           //TODO: Send client finalization events if necessary
66
67
          void PinReader::removeQueues() {
                    if (q) {
    client_fini2(q);
    q = NULL;
68
69
70
71
72
73
74
75
76
77
78
                    if (iq) {
                               client_fini(iq);
                              iq = N\overline{U}LL;
                    warn("Done");
          }
          PinReader::PinReader(const PinReaderParams *p) : MemTraceReader(p), simulating(false) {
    iq = client_init();
    q = client_init2();
    //Wait for the initial event
79
80
81
```

```
83
              while (q->empty()) { YIELD();}
 84
85
              drop = true;
readers.insert(this);
 86
87
       }
       PinReader::~PinReader() {
    removeQueues();
    readers.erase(this);
 88
 90
 92
 94
       \underline{\text{MemTraceRequestPtr}} \ \ \underline{\text{PinReader::getNextRequest(MemTraceReader::reason \& reason)}}
 95
             MemCmd::Command cmd;
 96
              MemCma::Command cmd;
MemTraceRequestPtr req;
Request::Flags flags;
if (drop) {
    assert(!q->empty());
    q->pop(); // Drop previous data
 98
100
102
              103
104
105
106
107
108
                                //The last dump should be mad
reason = MemTraceReader::EOT;
drop = false;
return NULL;
109
110
\frac{111}{112}
                          case SERVER_SIM_END:
    simulating = false;
    q->ack_control();
    reason = MemTraceReader::STAT_DUMP;
    drop = false;
    return NULL;
case SERVER_SIM_START.
113
114
115
116
117
118
                          case SERVER_SIM_START:
    simulating = true;
119
                          simulating = true;
q->ack_control();
reason = MemTraceReader::STAT_RESET;
drop = false;
return NULL;
case NONE:
121
123
124
125
                          default:
warn("State_not_supported!");
127
                    129
131
132
                                      133
135
136
137
                                             break;
case ACCREAD:
139
                                                   cmd = MemCmd:: ReadReq;
break;
140
141
                                             case ACCWRITE:
142
                                             cmd = MemCmd::WriteReq;
break;
case ACCPREFETCH:
143
144
145
                                                   flags.set(Request::PREFETCH);
cmd = MemCmd::ReadReq;
146
147
                                             break;
default:
148
149
150
                                                   panic ( " Access utype unknown " );
151
                                      } Addr ea = (Addr)ma.getEA(); ea &= (Addr)134217727; // 128Mb - 1 : P //By default time is set to 0 req = new MemTraceRequest((Addr)ea,(int)ma.getSize(),flags,cmd); drop = true; return req;
152
154
156
157
158
                                 case STARTTH:
160
                                case INVALDATA:
default:
                                       panic ( "Unexpected data type ");
162
164
                    while (!iq->empty()) {
```

# trace\_cpu.hh

```
_{\mathbf{4}}^{\mathbf{3}}
                 Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions and the: redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer; redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution; neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
  5
6
  9
10
11
12
13
                THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
15
16
17
19
21
23
24
25
26
27
28
29
                  Authors: Erik Hallnor
30
31
32
            * Declaration of a memory trace CPU object. Uses a memory trace to drive the
33
                 provided memory hierarchy
34
35
36
37
         #ifndef __CPU_TRACE_TRACE_CPU_HH __
#define __CPU_TRACE_TRACE_CPU_HH __
38
40
         #include <string>
41
         #include "mem/mem_object.hh"
#include "mem/packet.hh" // for RequestPtr
#include "mem/port.hh"
#include "params/TraceCPU.hh"
#include "sim/eventq.hh" // for Event
#include "sim/sim_object.hh"
42
43
\frac{44}{45}
46
48
          // Forward declaration
class MemTraceReader;
50
          enum CMD { Invalid , Read , Write , Writeback };
52
54
          ^{/**} * A cpu object for running memory traces through a memory hierarchy.
56
          class TraceCPU : public MemObject
58
               private:
                          class MemPort : public Port
60
61
                                   TraceCPU *tcpu;
PacketPtr retryPkt;
62
                                    bool accessRetry;
64
65
                                   MemPort(const std::string &_name, TraceCPU *_tcpu)
66
                                   : Port(_name, _tcpu), tcpu(_tcpu) { accessRetry = false; }
68
69
70
                                   bool locked() {
                                             return accessRetry;
                                    void sendPkt(PacketPtr pkt);
\begin{array}{c} \mathbf{73} \\ \mathbf{74} \end{array}
                         protected:
                                   virtual bool recvTiming(PacketPtr pkt);
76
                                   virtual Tick recvAtomic(PacketPtr pkt);
79
                                    virtual void recvFunctional(PacketPtr pkt);
                                    virtual void recvRangeChange();
```

```
84
85
                    virtual void recvRetry();
            );

/** Port for instruction trace requests, if any. */
MasterID _instMasterId;
 86
87
            MemPort icache;

/** Port for data trace requests, if any. */
MasterID _dataMasterId;

MemPort dcache;
 88
 89
90
 91
92
            /** Data reference trace. */
MemTraceReader *dataTrace;
 93
 94
95
               * Number of outstanding requests. */
 96
97
98
            int outstandingRequests;
            /** Next packet conatining data, time, request, command, etc */ MemTraceRequestPtr\ nextRequest;
99
100
101
             /** Reason for the packet to be NULL */
102
103
            MemTraceReader::reason reason;
104
            /** Next request. */
MemCmd::Command nextCmd;
105
106
107
            108
109
110
            class TickEvent : public Event
\frac{111}{112}
               private:
   TraceCPU *cpu;
113
114
115
116
                 TickEvent(TraceCPU *c): Event(CPU_Tick_Pri), cpu(c) {}

void process() { cpu->tick(); }

virtual const char *description() const { return "TraceCPU_tick"; }
\begin{array}{c} 117 \\ 118 \end{array}
\begin{array}{c} 119 \\ 120 \end{array}
121
            TickEvent tickEvent;
inline Tick ticks(int numCycles) const { return numCycles; }
123
124
125
126
            /**
    * Construct a TraceCPU object.
127
128
            TraceCPU(const TraceCPUParams *p);
129
130
            inline Tick ticks(int numCycles) { return numCycles; }
131
132
            133
134
135
            void tick();
136
137
138
            /**
    * Handle a completed memory request.
139
140
            void completeRequest(PacketPtr req);
141
142
143
            virtual Port *getPort(const std::string &if_name, int idx = -1);
144
      };
145
      #endif // __CPU_TRACE_TRACE_CPU_HH__
146
```

#### trace\_cpu.cc

```
3
                     Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions and the: redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer; redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution; neither the name of the copyright holders nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
   5
   6
10
11
12
13
                    THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
15
16
17
19
21
23
24
25
27
28
                     Authors: Erik Hallnor
29
30
31
32
              * Declaration of a memory trace CPU object. Uses a memory trace to drive the
33
                    provided memory hierarchy.
35
36
37
           #include <algorithm> // For min
38
           #include 'cpu/trace/reader/mem_trace_reader.hh'
#include 'cpu/trace/trace_cpu.hh'
// #include 'mem/base_mem.hh' // For PARAM constructor
// #include 'mem/mem_interface.hh'
//#include 'params/TraceCPU.hh'
39
40
41
42
           // #include "mem/mem_interfac
//#include "params/TraceCPU.h
#include "base/statistics.hh"
#include "mem/packet.hh"
#include "sim/eventq.hh"
#include "sim/sim_events.hh"
#include "sim/sim_exit.hh"
#include "sim/system.hh"
43
44
45
46
48
50
             using namespace std;
52
            {\tt TraceCPU::TraceCPU(const\ TraceCPUParams\ *p)}
            : MemObject(p),
_instMasterId(p->sys->getMasterId(name() + ".inst")), icache("instructions",this),
_dataMasterId(p->sys->getMasterId(name() + ".data")), dcache("data",this),
dataTrace(p->trace), outstandingRequests(0), tickEvent(this)
54
56
58
                        \label{eq:nextRequest} \begin{array}{ll} nextRequest = dataTrace -> getNextRequest (reason); \\ schedule(\&tickEvent, curTick() + ticks(1)); \end{array}
60
61
62
63
             //TODO: fix unaligned accesses out of block boundaries
             void
64
65
            TraceCPU::tick()
66
                        \begin{array}{ll} {\rm assert \, (outstanding Requests \, > = \, 0);} \\ {\rm assert \, (outstanding Requests \, < \, 1000);} \\ {\rm int \, inst Reqs \, = \, 0;} \\ {\rm //TODO \, \, convert \, \, to \, \, \, stats} \\ {\rm int \, \, data Reqs \, = \, 0;} \\ {\rm //TODO \, \, convert \, \, to \, \, \, \, stats} \end{array}
67
68
69
70
\begin{array}{c} 71 \\ 72 \end{array}
                        while (!nextRequest) {
   if ( outstandingRequests ) return;
                                    switch (reason) {
   case MemTraceReader::EOT:
\frac{73}{74}
                                                              // No more requests to send. Finish trailing events and exit. //TODO: fix this
76
                                                                         if (queue()->empty()) {
exitSimLoop("end_of_memory_trace_reached");
77
78
                                                                                                        } else {
    if (!tickEvent.scheduled())
79
                                                                                                                                    schedule(\mathcal{C}tickEvent, queue()->nextTick() + ticks(1));
```

```
return;
                           case MemTraceReader::STAT_RESET:
    nextRequest = dataTrace->getNextRequest(reason);
    Stats::reset();
    break;
 84
 85
 86
                           case MemTraceReader::STAT_DUMP:
 88
                                 ....m.riacenteauer::SIAI_DUNUF:
nextRequest = dataTrace->getNextRequest(reason);
Stats::dump();
 89
 90
 91
92
                                 break;
                    }
 93
              }
if (nextRequest->mustRun()) {
   int bsize = 0;
   if (nextRequest->isInstFetch()) {
      bsize=icache.peerBlockSize();
}
 94
 95
 96
 97
                    } else {
   bsize=dcache.peerBlockSize();
 98
 99
100
101
                     //Rest of the request: get the new address and the new size if (nextRequest->isInstFetch()) {
102
103
                           PacketPtr nextPkt = nextRequest->getNextPkt(bsize,0,_instMasterId);
// assert(nextPkt->req->thread_num < 4 88 "Not enough threads");
104
105
                           nextPkt->setSrc(0);
106
                           nextrkt->setsic (0);

++instReqs;

DPRINTF("id %d initiating %sread at addr %x (blk %x) expecting %x\n",
    id, do_functional ? "functional ": "", req->getPaddr(),
    blockAddr(req->getPaddr()), *result);
107
108
109
110
\frac{111}{112}
                           icache.sendPkt(nextPkt);
                           PacketPtr nextPkt = nextRequest->getNextPkt(bsize,0,_dataMasterId);
// assert(nextPkt->req->thread_num < 4 && "Not enough threads");
113
114
115
                           nextPkt->setSrc(0);
116
                           ++dataReqs;
                          \begin{array}{c} 117 \\ 118 \end{array}
119
120
121
123
124
125
126
127
                    }
//If we are done with the current packet we go for the next.
if (nextRequest->lastPacketSent()){
    delete nextRequest;
    nextRequest = dataTrace->getNextRequest(reason);
128
129
130
131
132
              } else if(!tickEvent.scheduled())
schedule(&tickEvent,max(curTick() + ticks(1), (nextRequest?nextRequest->time():0)));
133
134
       }
135
136
        Port
137
138
        TraceCPU::getPort(const std::string &if_name, int idx)
139
140
              if (if_name == "data")
              return &dcache;
else if (if_name ==
return &icache;
141
142
                                              "instructions")
143
144
                    panic("NouSuchuPort\n");
145
146
       }
147
148
149
150
        TraceCPU::MemPort::recvTiming(PacketPtr pkt)
151
152
              if (pkt->isResponse()) {
    tcpu->completeRequest(pkt);
153
              } else {
// must be snoop upcall
154
155
                    // mast ve shoop apean
assert (pkt->isRequest());
assert (pkt->getDest() == Packet::Broadcast);
156
157
158
159
160
       }
161
        Tick
162
163
        TraceCPU::MemPort::recvAtomic(PacketPtr pkt)
164
              \verb"panic" ( "Atomic" accesses" not" supported");
```

```
// must be snoop upcall
166
           assert(pkt->isRequest());
assert(pkt->getDest() == Packet::Broadcast);
return curTick();
167
168
169
170
171
172
      void
TraceCPU::MemPort::recvFunctional(PacketPtr pkt)
173
174
           //Do nothing if we see one come through
175
177
      }
178
179
      void
      TraceCPU::MemPort::recvRangeChange()
181
182
183
      void
TraceCPU::MemPort::recvRetry()
184
185
186
           if (sendTiming(retryPkt)) {
          DPRINTF(MemTest, "accessRetry setting to false\n");
        accessRetry = false;
        retryPkt = NULL;
}
187
188
189
190
191
192
     }
193
194
195
      TraceCPU::MemPort::sendPkt(PacketPtr pkt) {
196
197
              if (atomic) {
    cachePort.sendAtomic(pkt);
198
                   completeRequest(pkt);
199
200
201
           202
204
205
               accessRetry = true;
retryPkt = pkt;
206
           }
208
209
     }
     // TODO: handle stats
// void
// TraceCPU::regStats()
// {
210
211
212
214
215
216
              using namespace Stats;
              numReadsStat\\.name(name() + ".num\_reads")\\.desc("number of read accesses completed")
217
218
219
220
221
222
              num Writes Stat
              num WritesStat
.name(name() + ".num_writes")
.desc("number of write accesses completed")
223
224
225
226
              numExecsStat\\.name(name() + ".num\_exec")\\.desc("number of execution accesses completed")
227
228
229
230
231
232
      void
TraceCPU::completeRequest(PacketPtr pkt)
233
235
           Request *req = pkt->req;
237
          238
239
\frac{240}{241}
243
           //Remove the address from the list of outstanding
245
           247
```

```
249
                 } else {
                       //TODO: handle stats
if (pkt-)isRead()) {
    numReads++;
250
251
252
                                     numReadsStat++;
                                else {
  assert(pkt->isWrite());
  numWrites++;
254
256
                                     num \, Writes Stat++;
257
258
260
                 pkt->deleteData();
                 delete pkt->req;
delete pkt;
262
263
                 if (!tickEvent.scheduled())
264
265
                       schedule(&tickEvent, max(curTick() + ticks(1), (nextRequest?nextRequest->time():0)));
         }
266
267
268
         TraceCPU *
269
         TraceCPUParams::create()
270
         {
271
                 return new TraceCPU(this);
         }
272
\frac{273}{274}
         /* To convert*/
275
276
              void\\ MemTest::completeRequest(PacketPtr\ pkt)
277
278
279
                      Request * req = pkt -> req;
280
281
282
                      if (issueDmas)  {
283
                              dmaOutstanding = false;
284
285
                      287
289
290
                      MemTestSenderState * state =
291
292
                      dynamic\_cast < MemTestSenderState \ *>(pkt->senderState);
293
                      \begin{array}{lll} uints\_t &* data &= state -\!\!> \!\! data; \\ uints\_t &* pkt\_data &= pkt -\!\!> \!\! getPtr <\!\!uints\_t >\!\!(); \end{array}
294
295
296
                      //Remove the address from the list of outstanding
297
                      The move the data est from the tist of outstands std::set < unsigned > ::iterator\ remove Addr = outstanding Addrs. find (req->get Paddr()); assert (remove Addr! = outstanding Addrs.end()); outstanding Addrs.erase (remove Addr);
298
299
300
301
302
                       \begin{array}{ll} if \;\; (pkt->isError\,()) \;\; \{ \\ if \;\; (!suppress\_func\_warnings) \;\; \{ \\ warn\,("Functional \;\; Access \;\; failed \;\; for \;\; %x \;\; at \;\; %x\backslash n \;", \\ pkt->isWrite\,() \;\; ? \;\; "write " \;\; : \;\; "read " \;\; , \;\; req->getPaddr\,()); \\ \vdots \\ \end{array} 
303
304
305
306
307
                      308
309
                                      \begin{array}{lll} pkt->isRead()) & \{ & if \; (memcmp(pkt\_data \;,\; data \;,\; pkt->getSize()) \; != \; 0) \; \{ & \\ & panic("%s: \; read \; of \; \%x \; (blk \; \%x) \; @ \; cycle \; \%d \; " \\ & "returns \; \%x \;,\; expected \; \%x \backslash n \; " \;,\; name() \;, \\ & req->getPaddr() \;,\; blockAddr(req->getPaddr()) \;,\; curTick() \;,\; *pkt\_data \;,\; *data) \;; \\ \end{array} 
310
311
312
313
314
315
                                    }
\frac{316}{317}
                                    numReads++;
318
                                     numReadsStat++;
                                      \begin{array}{lll} if & (numReads == (uint64\_t) nextProgressMessage) \ \{ & ccprintf(cerr, "%s: completed \%d read, \%d write accesses @\%d \ n", \\ & name(), numReads, numWrites, curTick()); \\ & nextProgressMessage \ += progressInterval; \\ \end{array} 
320
322
323
324
                             325
326
327
328
                                     assert(pkt->isWrite());
                                     funcPort.writeBlob(req->getPaddr(), pkt\_data, req->getSize());
330
```

```
numWritesStat++;
333
                                                                                           }
                                                                      }
335
                                                                        noResponseCycles = 0;
                                                                      337
339
340
341
343
                            // void
// MemTest::tick()
// {
345
347
                                                                     349
351
352
353
354
355
356
                                                                                                        unsigned\ dma\_access\_size = random()\ \%\ 4;\ */
357
                                                                       unsigned\ cmd=0; offset++;
358
359
                                                                      unsigned base = 0;
uint64_t data = random();
unsigned access_size = 0;
bool uncacheable = false;
360
361
362
363
364
365
                                                                        unsigned dma_access_size = random() % 4;
366
367
                                                                        //If we aren't doing copies, use id as offset, and do a false sharing
                                                                     //If we aren't doing copies, use in as office, and then use the id
//we can eliminate the lower bits of the offset, and then use the id
//to offset within the blks
// offset = blockAddr(offset);
// offset += id;
// access_size = 0;
// dma_access_size = 0;
368
370
372
373
374
                                                                       \begin{array}{lll} Request \ *req = new \ Request(); \\ Request:: Flags \ flags; \end{array}
376
                                                                       Addr paddr;
378
                                                                        \begin{array}{ll} if & (uncacheable) & \{\\ & flags.set \, (Request:: \textit{UNCACHEABLE}) \,;\\ & paddr = uncacheAddr \, + \, offset \,; \end{array} 
380
381
382
                                                                        \begin{array}{lll} pattar & -a & a & b & b \\ pattar & -a & b \\ pa
384
385
                                                                        bool\ do\_functional = false;
386
387
                                                                      \begin{array}{l} if \ (issueDmas) \ \{ \\ paddr \ \mathcal{B} = \sim ((1 << dma\_access\_size) - 1); \\ req -> setPhys (paddr, 1 << dma\_access\_size, flags); \\ req -> setThreadContext (id, 0); \\ \end{array}
388
389
390
391
392
                                                                               else
                                                                                           \begin{array}{lll} & & & \\ paddr & \mathcal{B} = & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\ & & \\
393
394
395
396
397
                                                                        assert(req->getSize() == 1);
398
399
                                                                        uint8\_t * result = new uint8\_t[8];
400
                                                                        \begin{array}{ll} if & (\mathit{cmd} \, < \, \mathit{percentReads}\,) & \{ \\ & // & \mathit{read} \end{array} 
401
403
                                                                                             // For now we only allow one outstanding request per address
// per tester This means we assume CPU does write forwarding
// to reads that alias something in the cpu store buffer.
if (outstandingAddrs.find(paddr) != outstandingAddrs.end()) {
    delete [] result;
    delete req;
    return:
404
405
406
407
409
                                                                                                                       return;
411
                                                                                              outstanding Addrs.insert(paddr);
413
```

```
// ***** NOTE FOR RON: I'm not sure how to access checkMem. — Kevin funcPort.readBlob(req->getPaddr(), result, req->getSize());
\begin{array}{c} 416 \\ 417 \end{array}
                                                   \begin{array}{c} ccprintf(cerr\,,\\ "id\ \%d\ initiating\ \%sread\ at\ addr\ \%x\ (blk\ \%x)\ expecting\ \%x\backslash n\,",\\ id\,,\ do\_functional\ ?\ "functional\ "\ :\ "",\ req->getPaddr()\,,\\ blockAddr(req->getPaddr())\,,\ *result\,); \end{array} 
418
 420
422
                                                  \label{eq:packetPtr} \begin{array}{lll} PacketPtr & pkt = new & Packet(req \,, \, \, MemCmd::ReadReq \,, \, \, Packet::Broadcast); \\ pkt->setSrc \,(0); \\ pkt->dataDynamicArray (new & uint8\_t \left[req->getSize \,()\right]); \\ MemTestSenderState & *state = new & MemTestSenderState \,(result); \\ pkt->senderState & = state; \end{array}
 423
 424
 426
 427
 428
                                                   if \ (do\_functional) \ \{\\ assert(pkt->needsResponse());\\ pkt->setSuppressFuncError();\\ cachePort.sendFunctional(pkt);\\ \end{cases}
 429
430
 431
 432
 433
                                                                completeRequest(pkt);
 434
                                                   } else
                                                             sendPkt(pkt);
 435
 436
                                      } else {
// write
 437
 438
 439
                                                  // For now we only allow one outstanding request per addreess
// per tester. This means we assume CPU does write forwarding
// to reads that alias something in the cpu store buffer.
if (outstandingAddrs.find(paddr) != outstandingAddrs.end()) {
    delete [] result;
    delete req;
    return;
}
 440
 441
 442
443
 444
 445
 446
 447
 448
 449
                                                   outstanding Addrs.insert (paddr);\\
 450
                                                  \begin{array}{lll} DPRINTF(MemTest, \ "initiating \ \%swrite \ at \ addr \ \%x \ (blk \ \%x) \ value \ \%x \backslash n", \\ do\_functional \ ? \ "functional \ " : \ "", \ req->getPaddr(), \\ blockAddr(req->getPaddr()), \ data \ \& \ 0xff); \end{array}
 451
 452
 453
                                                  \label{eq:packetPtr} \begin{array}{lll} PacketPtr & pkt = new & Packet(req\ , \ MemCmd::WriteReq\ , \ Packet::Broadcast); \\ pkt->setSrc\ (0); \\ uint8\_t & *pkt\_data = new & uint8\_t[req->getSize\ ()]; \\ pkt->dataDynamicArray\ (pkt\_data); \\ memcpy\ (pkt\_data\ , \ Eqe->getSize\ ()); \\ MemTestSenderState & *state = new \ MemTestSenderState\ (result\ ); \\ pkt->senderState & = state; \\ \end{array}
455
 456
 457
 459
 460
 461
                                                    \begin{array}{ll} if & (\textit{do\_functional}) & \{\\ & \textit{pkt-}{>} \textit{setSuppressFuncError}();\\ & \textit{cachePort.sendFunctional}(\textit{pkt}); \end{array} 
 463
 464
 465
 466
                                                                completeRequest(pkt);
                                                       else
 467
                                                            else \{sendPkt(pkt);
 468
 469
```

# pintrace.py

```
\# Copyright (c) 2006—2007 The Regents of The University of Michigan \# All rights reserved.
   3
              #
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions as
# met: redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer;
# redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution;
# neither the name of the copyright holders nor the names of its
# contributors may be used to endorse or promote products derived from
# this software without specific prior written permission.
13
                "# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
15
              # THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
# *AS IS * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
WONNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
$PECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
16
17
19
21
23
25
                # Authors: Ron Dreslinski
27
28
29
               import optparse
import sys
30
31
                {\bf import}\ {\rm m5}
32
                from m5. objects import *
33
                parser = optparse.OptionParser()
35
36
37
                \#parser.add\_option("-m", "--maxtick", type="int", default=m5.MaxTick, type="int", default=m5
                                                                                      \#metavar = "T"
38
                                                                                       \#help = "Stop after T ticks")
40
41
                (options, args) = parser.parse args()
42
43
                if args:
                                    print "Error: uscript udoesn't utake uany upositional uarguments"
\frac{44}{45}
                                    sys.exit(1)
46
                # define prototype L1 cache
                proto_l1 = BaseCache(size = '32kB', assoc = 4, block_size = 128, latency = '1ns', tgts_per_mshr = 1)
48
50
                {\tt proto\_l1.mshrs} \, = \, 1
52
                pr = PinReader()
54
                tcpu = TraceCPU(trace = pr)
56
               58
60
              # sustem simulated
                system = System(physmem = PhysicalMemory(latency = "100ns"))
62
63
                new_bus = Bus(clock="500MHz", width=16)
system.physmem.cpu_side_bus = new_bus
system.physmem.port = new_bus.master
64
65
66
                data_l1 = BaseCache(size = '32kB', assoc = 4, block_size = 64, latency = '1ns', tgts_per_mshr = 8)
68
69
                data\_l1.mshrs = 1
70
                \label{eq:loss_loss}  \begin{array}{ll} ins\_l1 \, = \, BaseCache(\,size \, = \, {\tt '32kB'}\,, \,\, assoc \, = \, 4\,, \,\, block\_size \, = \, 64\,, \\ latency \, = \, {\tt '1ns'}\,, \,\, tgts\_per\_mshr \, = \, 8) \end{array}
\frac{73}{74}
                ins 11.mshrs = 1
75
                new_bus.cache = [ data_l1 , ins_l1 ]
76
                new_bus.slave = data_l1.mem_side
new_bus.slave = ins_l1.mem_side
79
                data_l1.cpu = tcpu
                tcpu.data = data_l1.cpu_side
```

```
 \begin{array}{l} \textbf{tcpu.instructions} &= \textbf{ins\_l1.cpu\_side} \\ \# def & make\_level(spec\,,\ prototypes\,,\ attach\_obj\,,\ attach\_port)\colon \\ \# fanout &= spec [0] \\ \# parent &= attach\_obj\,\,\#\ use\ attach\ obj\ as\ config\ parent\ too \\ \# if\ len(spec) > 1\ and\ (fanout > 1\ or\ options.force\_bus)\colon \\ \# new\_bus &= Bus(clock='500MHz',\ width=16) \\ \# new\_bus.port &= getattr(attach\_obj\,,\ attach\_port) \\ \# parent.cpu\_side\_bus &= new\_bus \\ \# attach\_obj &= new\_bus \\ \# attach\_obj &= new\_bus \\ \# attach\_port &= "port" \\ \# objs &= [prototypes[0]()\ for\ i\ in\ xrange(fanout)] \\ \# if\ len(spec) > 1\colon \\ \# we\ just\ built\ caches\,,\ more\ levels\ to\ go \\ \# parent.cache &= objs \\ \# for\ cache\ in\ objs\colon \\ \# cache\ .mem\_side &= getattr(attach\_obj\,,\ attach\_port) \\ \# make\_level(spec[1:],\ prototypes[1:],\ cache\,,\ "cpu\_side") \\ \# else: \end{array}
              tcpu.instructions = ins_l1.cpu_side
  84
  85
  86
  88
   89
  90
  91
92
   93
  94
   95
  96
  97
98
   99
                            #else:
100
                                          e:
### we just built the MemTest objects
#parent.cpu = objs
#for t in objs:
#t.test = getattr(attach_obj, attach_port)
#t.functional = system.funcmem.port
101
102
103
104
105
106
107
              \#make\_level \, (\, treespec \; , \; \; prototypes \; , \; \; system \, . \, physmem \, , \; \; "port \, ")
108
109
              110
111
112
               root = Root( full_system = False, system = system )
root.system.mem_mode = 'timing'
113
114
\frac{115}{116}
               root.system.system_port = root.system.physmem.port
\begin{array}{c} 117 \\ 118 \end{array}
              \# Not much point in this being higher than the L1 latency m5.ticks.setGlobalFrequency( <code>'lns')</code>
\begin{array}{c} \mathbf{119} \\ \mathbf{120} \end{array}
               # instantiate configuration
121
              m5.instantiate()
123
               # simulate until program terminates
exit_event = m5.simulate(m5.MaxTick)
124
125
               127
```

# Bibliography

- [1] Kenneth Barr. Dinerotool. Oct. 2005. URL: http://kbarr.net.
- [2] Nathan Binkert et al. "The gem5 simulator". In: SIGARCH Comput. Archit. News 39.2 (Aug. 2011), pp. 1–7. ISSN: 0163-5964. DOI: 10.1145/2024716.2024718. URL: http://doi.acm.org/10.1145/2024716.2024718.
- [3] Zhongliang Chen et al. *The Multi2Sim Simulation Framework*. URL: http://www.multi2sim.org/files/multi2sim-r277.pdf.
- [4] Circular buffer. Nov. 2012. URL: http://en.wikipedia.org/w/index.php?title=Circular\_buffer&oldid=522370238#Always\_Keep\_One\_Slot\_Open.
- [5] H. J. Curnow and B. A. Wichmann. "A synthetic benchmark". In: The Computer Journal 19.1 (1976), pp. 43-49. DOI: 10.1093/comjnl/19. 1.43. eprint: http://comjnl.oxfordjournals.org/content/19/ 1/43.full.pdf+html. URL: http://comjnl.oxfordjournals.org/ content/19/1/43.abstract.
- [6] Susan L. Graham, Peter B. Kessler, and Marshall K. Mckusick. "Gprof: A call graph execution profiler". In: SIGPLAN Not. 17.6 (June 1982), pp. 120–126. ISSN: 0362-1340. DOI: 10.1145/872726.806987. URL: http://doi.acm.org/10.1145/872726.806987.

Bibliography

Bibliography

[7] Mark Hill and Jan Edler. Dinero IV Trace-Driven Uniprocessor Cache Simulator. Feb. 1998. URL: http://www.cs.wisc.edu/~markhill/DineroIV/.

- [8] Chi-Keung Luk et al. "Pin: building customized program analysis tools with dynamic instrumentation". In: Proceedings of the 2005 ACM SIG-PLAN conference on Programming language design and implementation. PLDI '05. Chicago, IL, USA: ACM, 2005, pp. 190–200. ISBN: 1-59593-056-6. DOI: 10.1145/1065010.1065034. URL: http://doi.acm.org/10.1145/1065010.1065034.
- [9] J.E. Miller et al. "Graphite: A distributed parallel simulator for multicores". In: *High Performance Computer Architecture (HPCA)*, 2010 *IEEE 16th International Symposium on.* Jan. 2010, pp. 1 -12. DOI: 10.1109/HPCA.2010.5416635. URL: http://groups.csail.mit.edu/carbon/docs/graphite hpca2010 preprint.pdf.
- [10] Vijay Janapa Reddi et al. "PIN: a binary instrumentation tool for computer architecture research and education". In: *Proceedings of the 2004 workshop on Computer architecture education: held in conjunction with the 31st International Symposium on Computer Architecture*. WCAE '04. Munich, Germany: ACM, 2004. DOI: 10.1145/1275571.1275600. URL: http://doi.acm.org/10.1145/1275571.1275600.
- [11] Cloyce D. Spradling. "SPEC CPU2006 Benchmark Tools". In: SIGARCH Computer Architecture News 35 (1 Mar. 2007).
- [12] Richard M. Stallman. GDB manual: the GNU source-level debugger.
  2nd, GDB version 2.5. Free Software Foundation, Inc. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA, Tel: (617) 876-3296, Feb. 1988, pp. ii + 63.

Bibliography Bibliography

[13] Richard M. Stallman. Using and Porting GNU CC. Tech. rep. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA, Tel: (617) 876-3296: Free Software Foundation, Inc., 1988.

- [14] The gcc website. URL: http://gcc.gnu.org/.
- [15] The qdb website. URL: http://www.gnu.org/software/gdb/.
- [16] The Gem5 website. URL: http://www.gem5.org/.
- [17] The gprof website. URL: http://sourceware.org/binutils/docs/gprof/.
- [18] The Graphite website. URL: http://groups.csail.mit.edu/carbon/?page\_id=111.
- [19] The modified SPLASH-2 website. URL: www.capsl.udel.edu/splash/.
- [20] The Multi2Sim website. URL: http://www.multi2sim.org/.
- [21] The Pin website. URL: http://software.intel.com/en-us/articles/pintool/.
- [22] The SPEC CPU2006 website. URL: http://www.spec.org/cpu2006/.
- [23] The SPLASH-2 website. URL: http://web.archive.org/web/http://www-flash.stanford.edu/apps/SPLASH/.
- [24] vanDooren. Creating a thread safe producer consumer queue in C++ without using locks. Jan. 2007. URL: http://msmvps.com/blogs/vandooren/archive/2007/01/05/creating-a-thread-safe-producer-consumer-queue-in-c-without-using-locks.aspx.
- [25] Reinhold P. Weicker. "Dhrystone: a synthetic systems programming benchmark". In: Commun. ACM 27.10 (Oct. 1984), pp. 1013-1030. ISSN: 0001-0782. DOI: 10.1145/358274.358283. URL: http://doi.acm.org/10.1145/358274.358283.

Bibliography

[26] S.C. Woo et al. "The SPLASH-2 Programs: Characterization and Methodological Considerations". In: *Proc. of the 22nd International Symposium on Computer Architecture*. June 1995.