Concurrent Objects

Nov 17, 2014

Method calls take time whereas method calls by a single thread are always sequential.

Compositional

A correctness property P is compositional if, whenever each object in system satisfies P, the system as a whole satisifies P. Basically we can compose objects satisfying a property P to create a more complex system that also satisfies P. For example, Quiescence.

Quiescent Consistency

Principles:

Method calls should appear in one-in-a-time sequential order. --| trivial |
Method calls separated by a period of quiescence should appear to take effect in real-time order.

Papi native hardware counters

Nov 12, 2014

Check for the available native hardware counters:

$ papi_native_avail

Sample code to get the values for native hardware counters

#include <stdio.h>
#include <papi.h>

#define MAX_COUNTERS 256
int rozmer = 1;

static char * events[] = {
    "PERF_COUNT_HW_CACHE_L1D:WRITE",
    "PERF_COUNT_HW_CACHE_L1D:ACCESS",
    "PERF_COUNT_HW_CACHE_L1D:READ",
    "PERF_COUNT_HW_CACHE_L1D:PREFETCH",
    "PERF_COUNT_HW_CACHE_L1D:MISS",
    "HW_PRE_REQ:L1D_MISS",
    //"L1-DCACHE-LOADS"
};

#define NUMCOUNTERS 6
void compute() {
    int EventCode, retval;
    int EventSet = PAPI_NULL;
    long long PAPI_Counters[MAX_COUNTERS];

Fast Paxos

Nov 11, 2014

Algorithm executes multiple rounds.
An acceptor votes to accept atmost one value in a single round.
Goal: Achieve consistency by ensuring that different values are not chosen in different rounds.
Note that rounds can run concurrently, may be skipped altogether.

Accepter

State maintained by accepter:

# state maintained by acceptor
rnd[a]
vrand[a]
vval[a]

rnd[a] - highest numbered round in which a participated, initially 0.
vrand[a] - highest numbered round in which a has cast a vote, initially 0.
vval[a]- The value a voted to accept in round vrand[a]; it's initial value is irrelevant. can be inferred using vrand[a] == 0

Ruby Notes

Nov 6, 2014

Lambdas vs Procs

Both are proc objects

l = lambda {|name| puts "hello #{name}"}
p = proc   {|name| puts "hello #{name}"}

l # => #<Proc:0x007ff16389da58@-:1 (lambda)>
p # => #<Proc:0x007ff16389da30@-:2>

Return behaves differently

lambda returns from the context of lambda only.

Spanner: Google's Globally Distributed Database

Nov 6, 2014

Spanner: Database that shards data across many set of Paxos state machines in data centers spread accross the world.

Time API

Sawtooth peak when executing, comes back when synced from server.

Smax: largest timestamp till relpica is up to date.

Leases

leaders participate in local and distributed transactions. So it's better to have leaders for long time. Auto released after expiration.

extend the lease: - leader can ssign timestamp to lease in future. - leader can try to extend it's lease to make sure it's present when transaction happens.

map reduce

Nov 6, 2014

map: processes a key/value pair to generate a set of intermediate key/value pairs.

reduce: merges all intermediate values associated with same intermediate keys.

map    (k1, v1)       -> list(k2, v2)
reduce (k2, list(v2)) -> list(v2)

All the mappers have to finish before reduces can end.

Paxos Notes

Oct 29, 2014

Source: Paxos Made Practical - David Mazieres

Paxos is a simple protocol tat a group of machines in a distributed system can use to agree on a value proposed by a member of the group.

If it terminates, the protocol reaches consensus even if the network was inreliable and mutiple machines simultaneously tried to propose different values.

The Basic idea is that each proposal has a unique number.

Make notes

Oct 27, 2014

Makefile basic structure

target: source
  command

PHONY prevents rm to be executed each time, make clean is called. Consider a case when there are no .o files, make clean will still execute rm otherwise.

.PHONY
clean:
  rm *.o

Get the dependencies source for a given file.

concurrency

Oct 23, 2014

Determinism doesn't effectively solves the problem.
Symbolic execution

Transactions and Concurrency

ACID

Atomicity: All or nothing. Log to clean if system fails. Output commit -> you are screwed. Consistency: Internal consistency of the database. Isolation: Executes if it were running along. Durability: Results will not be lost.

Memory Transactions: no guarantee, what if power goes down.

static and dynamic linking

Oct 17, 2014

// add.h
int add(int, int);

// sub.h
int sub(int, int);

Create the corressponding c files.

// add.c
int add(int i, int j) {
  return i + j;
}

// sub.c
int sub(int i, int j) {
  return i - j;
}

process address space

Oct 15, 2014

On IA-32 systems with 2^32 = 4GiB, the address space is usually split in 3:1 ratio. The kernel is assied 1GiB, while 3GiB is available to each userspace process.
The contents of kernel address space are always same regardless of which user process is active.

Layout of the Process Address Space

text segment: binary code of the code currently running.
code for dynamic libraries.
heap where global variables and dynamically generated data are stored.
stack used to store local variables and to implement function calls.
Sections with environment variables and command-line arguments.
Memory mappings that map the contents of files into the virtual address space.

Each process has mm_struct instance that can be accessed by task structure.

program loading and memory mapping

Oct 14, 2014

`arch_prctl(ARCH_SET_FS, addr)`

From man page:

set architecture specific thread state

ARCH_SET_FS: set the 64-bit base address for the FS register to addr

Earlier kernel used to store an array indexed by threadid containing address of thread space, however now it sets the FS register for threads local storate. GS is used for kernel threads storage pointer whereas FS is used for userspace threads.

java notes

Oct 13, 2014

Good time library: JodaTime
Logging: Log4j

Device Drivers

Oct 9, 2014

Device driver used to interact with device. Major number is used to identify device driver.

Character Device: writes and reads character by character. Operates in blocking way (synchronous, user must wait for completion). More common.
Block Device: write/read block by block (larfer).

Google File System

Oct 9, 2014

Source: http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf
Published: 2003

Design Constraints

Component failures are the norm rather than exception. FS consists of thousands if commodity parts and is accessed by comparable number of clients.
Files are huge by traditional standards. Multi-GB files are common. Billions of approximately KB-sized files.
Most files are mutated by appending new data rather than overwriting existing data.
Co-designing the applications and the file system API benifits the overall system by increasing flexibility.

Architecture

GFS consists of a single master and multiple chunkservers and is accessed by multiple clients.

Class Notes:

Problem in datacenters: failures.

Master: Single point of failure, master has log of operations.

Lease: soft state.

Revoke:

wait
contact server

Snapshot:

revoke lease
copy-on-write

snapshot on large files:-> copy on writes.

Contact replicas and tell them to make local copies. Update the metadata to point to the individual chunks.

distributed the consistency to both application libraries.
unique identifier could be a problem.

chunks metadata at master are not persisted.

Appends are ordered:

1) lease and primary 2) primary serial #'s

Master doesn't know about the namespaces either, it is also stored on chunkservers.

No hardlinks/softlinks
absolute paths
no datastructure representing directories like inodes
no datastructure to enumerate contents of one directory

Memory Resource Management in VMWare ESX Server

Oct 2, 2014

Source: https://www.usenix.org/legacy/event/osdi02/tech/waldspurger/waldspurger.pdf
Published: 2002

VMWare ESX Server is a thin layer designed to multiplex hardware resources efficiently among virtual machines running unmodified commodity operating systems.

Memory Virtualization

All guest operating systems expect a zero-phased physical address space as provided by real hardwares. ESX Server gives this illusion by adding an extra level of translation. In this mechanism, Virtual addresses are mapped to "Physical Address" (provided by ESX Server) which are further mapped to Machine addresses (actual memory on chip). Seperate shadow pages tables, which contains virtual to machine page mappings for use by the processor.

 -----------------      PMAP in ESX    ------------------     Page Table in OS   -----------------
| Machine Address |   <-------------  | Physical Address |  <-----------------  | Virtual Address |
 -----------------                     ------------------                        -----------------

Reclamation Mechanisms

ESX Server usually overcommits memory and needs a reclamation mechanism to reclaim memory from one or more virtual machines.

Balooning

A small balloon module is loaded into the guest OS as a pseudo-device driver or kernel service. It doesn't have any external interface. When ESX server wants to reclaim memory, it instructs the driver to 'inflate' by allocating pinned physical pages within the VM. This creates a memory pressure inside the VM and guest OS starts reclaiming memory to satisfy drivers request. The balloon driver communicates the physical page number to ESX server, which may then recalim the corressponding machine page.

Technically guest os shouldn't touch pages allocated to balloon driver, however ESX server doesn't rely on guest os correctness. It annotates its pmap entry and deallocates its associated MPN (machine page number). any subsequent attempt to access that memory will generate a page fault that is handled by the server.

Demand Paging

ESX server preferentially uses ballooning to reclaim memory, treating as a common-case optimization if ballooning is not possible or insufficient, memory is reclaimed by paging out to an ESX Server swap area on disk, without any guest involvement.

A randomized page replacement policy is used so that higher level page semantics are not effected.

Sharing Memory - Content-Based page sharing

ESX server maintains a hashmap of pages where key is hash summarizing the content in page. If the hash value for new page matches an existing page, then the whole contents of the page are compared and if they match a reference count is incremented and the page is shared. Any attempt to write to a shared page will generate a fault, transparently creating a new page. (COW - copy on write)

If no match is found the page is hashed but it is not marked as COW. Instead it is tagged as a special hint entry. On any future match, contents of the hint page are rehashed and checked to see if the page has been modified. If the hash is still valid, a full comparison is performed and page is shared (marked as COW now).

Share based allocation

Resource rights are encapsulated by shares, which are owned by clients that consume resources. Traditionally VMs that have more shares are allocated more resources nad in case of reclamation, the VMs with lower shares are used to reclaim memory. It may happen that VMs with larger share are not using the memory and smaller VMs are the one that can actually benifit from the memory. So we don't have good resource utilization here.

ESX server introduces the concept of idle memory tax which charges more for the idle pages than the pages that are activally used.

Measuring Idle memory

ESX server uses a statistical sampling approach to obtain aggregate VM working set estimates directly, without any guest involvement. At the start of sampling period, a small number of the vm's "physical" pages are selected and their corresponding TLB enteries and MMUU state is invalidated. When VM access those pages, it leads to a fault and a use of the page is registered. At the end of period fraction f of memory actively accessed by the vm can be calculated by simply f = t/n (t page faults out of n invalidations)

writing my own networked file system

Sep 26, 2014

https://github.com/goyalankit/os/blob/master/lab2/netfs.c

log structured file system

Sep 25, 2014

In LFS, log is the only structure on disk. The log is divided into segments (to maintain space for large files) where data is written sequentially. segment cleaner compresses the information from heavily fragmented code.

Log also stores indexing information so that the files can be read back with efficiency comparable to current file systems.

LFS is based on assumption that files are cached in main memory, so the majority of traffic write traffic which is efficient since we are writing sequentially and eliminating seeks in the disk.

Sprite LFS uses 60-75% of the disk bandwith where as unix file system uses about 5-10% since most of the time is spent in seeking.

The main challenge of LFS is to ensure that there are always large extents of free space available for writing new data. Large extents called segments are used, where segment cleaner continually regenerates empty segments by compressing the live data from heavily fragmented segements.

File system is governed by two basic building blocks: technology and workload.

Three components that are significant for file system design:

Processors- getting faster and fater, putting more pressure on memory
Main Memory - Can cache files and change the predominant type of data, like more writes than reads, so you can optimize you FS for writes
Disks - well that's where you store your data, they have two components:
- Transfer Bandwith - can be improved using disk arrays and parallel-head disks.
- Access Time - depends on mechanical head (at the time of publishing).

Unix file system writes metadata structures such as directories and inodes synchronously, which is not good for lot of small files.
LFS buffers a sequence of file system changes in the file cache and writes them sequentailly in a single disk operation.

Key Issues

Information Retrieval - uses inodes and maintains a inode map which contains the location of inodes.
Large free space

xen - the art of virtualization

Sep 23, 2014

x86 virtualization challenges

EFLAGS register has interrupt enable flag(IF). If the kernel is being virtualized it doesn't have privilage to enable interrupts. Kernel calls pushf and popf all over the place.
Register cr3 points to the base of page table. Hardware MMU will read cr3 on page faults if the virtual addressing is enabled. VMM can't trust OS, solved by either shadow page tables (ESX Server) or allowing updates after verification (XEN)
Untagged TLB means frequent flushes. Addressed with hardware translation.

Consider page fault

ESX Server

Guest application: refers to unmapped memory address.
Hardware: TRAP! set privilaged bit. jump to VMM trap handler
VMM: Figure out which guest OS, clear privilaged bit, check shadow page table (or software TLB cache), if page not resident, set up trap registers for guest OS; jump to that guest OS's trap handler.
Guest OS: Read cr2 to find the faulting address
HW: TRAP! You can;t read the cr2; set privileged bit; jump to VMM trap handler.
VMM: Read the cr2 and put faulting address it into guest OS register; clear privileged bit.
Guest OS: Allocate physical page and write page table entry from VA -> PA
HW: TRAP! you can't write(READ-ONLY access) to page table. set privilaged bit; jump to VMM handler
VMM: Allocate a page of machine memory,record PA->MA mapping and install it in shadow page table; clear privileged bit; return to guest OS trap handler.
Guest OS: everything worked fine. `iret` Clear the privileged bit and return to guest handler
HW: TRAP! (you don't get to clear the priveleged bit); set privileged; jump to VMM handler
VMM: `iret` to guest application

XEN

For each privilage level, a separate stack is maintained.

Guest OS runs in Ring 1, Application runs in Ring 4 and VMM runs in Ring 0.

When application is executing it has its own stack in Ring 4, now when we have a page fault, hardware traps and calls the trap handler. Trap handler is directly registered by Guest OS (though validated by XEN). Trap handeling code runs in Ring 1 (guest OS)

Memory:

Guest OSes allocates and manage hardware page tables.
Guest OS allocates and registers with xen giving away its write privileges.
All updates (using hyper calls) need to be validated by XEN.
XEN registers page tables directly with MMU.
To aid validation frames are marked: page directory, page table, local descriptor table, global des. tab., Writable.

CPU:

basic assumption: os is most privileged., rings help.
Privileged instructions are paravirtualized by requiring them to be validated and executed within xen. Any guest os attempt to execute privileged instruction: failed by processor (silently or trap).
Exceptions and trap are virtualized straightforwardly. A table describing the handler for each type is registered with XEN for validation. exception stack frames are unmodified in paravirt.
Excepions can be directly handled by hardware, but page faults need to go through XEN.
Page fault handler would read cr2(not possible), copy it to the extended stack frame.
On page fault, xen copies stack frame to guest os and return control to registered handler.Trap handlers (guest OS) are in Ring 1. CR2 is in ring 0. M2P

Device I/O

clean and simple abstractions.
interrupts -> lightweight event delivery mechanism.

Control and Management:

Hypervisor will be involved in scheduling but there is no need to be involved in high level details such as how to share the CPU among domains. or what kind of packets a domain can transfer.
Hypervisor provides control operations through an interace accessible from authorized domains.
Domain 0 is created at boot time and has access to control interface and is responsible for application level management software.
Control interface provides the ability to create/destrot domains, their scheduling parameters, physical memory allocations and access they are given to physical resources like physical disk and network devices. *

Class Notes:

Domain 0 - is a VM. Device drivers are put inside domain 0.

write a block of data ==> goes to VMM ==> goes to VM (converting it to actual blocks).

Not XEN RELATED Hardware virtualization system calls and page faults don't trap.

VA - PT - PA

VMWare VA - Shadow PT - MA PA - PMAP - MA

Shadow tables - per guest application. Since it maps from VA. PMAP is per guest OS

Extended page table - per VM or per guest.

A VM has well defined PA space

Two ways of VMM

Build it into operating system. OS like explicitly running.

ESX server(special purpose VMM) vs ESX workstation (general purpose os)

Double Paging - two different pieces of software working on it.

Force guest OS to use it's own paging algorithm

information on exceptions and interrupt handlers

http://cse.unl.edu/~goddard/Courses/CSCE351/IntelArchitecture/IntelInterupts.pdf

scheduling basics

Sep 20, 2014

Scheduling policy in Linux Kernel

source: Understanding the Linux Kernel, Chapter 7

CPU time is divided into slices, one for each runnable process. Time sharing relies on timer interrupts and is thus transparent to processes.

For scheduling processes are traditionally classified as: I/O-bound or CPU-bound.

Another classification distinguishes three kind of processes:

Interactive Processes: Constantly interact with users, these programs spend lot of time waiting for keypress and should be woekn up quickly when input is received. e.g., shell, text editors, etc.
Batch Processes: Not much user interaction and often run in background. Often penalized by the scheduler. e.g., compilers, database search engines and scientific computations.
Real-time process: Stringent scheduling requirements, should never be blocked due to low priority processes. e.g., video and sound applications, controllers collecting data from sensors.

Process Preemption:

Linux processes are preemptable (both in Kernel or User mode). When a process enters TASK_RUNNING state, the kernel checks if its dynamic priority is greater than the priority of the currently running process. If it is, execution of current is interrupted and scheduler is invoked to select another process.

e.g., 2 programs - text editor and compiler. Compiler is running, user presses a key, an interrupt is raised and the kernel wakes up the editor process. Kernel also checks that the dynamic priority of text editor program is higher than compiler, so it sets TIF_NEED_RESCHED for the current process (thus forcing the scheduler when interrupt handler reutrns.). As the interrupt handler returns, scheduler is invoked and the context switch happens. Note: preempted process is not suspended, it remains in TASK_RUNNING state.

Quantum size

If quantum is too short, context switches get expensive. If it's too long, it may make system unresponsive. e.g., two users launch process together, one of them is interactive and other is batch. If scheduler schedules batch first initially, other process becomes unresponsive.

Note that too long a quantum doesn't degrades the response time of interative processes in general due to process preemption.

The choice of quantum is usually a compromise and Linux chooses a duration as long as possible, while keep a good response time.

practical cgroups

Sep 15, 2014

To list cgroups of a particular process (say PID= 16455)

$ cat /proc/16455/cgroup

Example Output: ```sh

PID = 16455

goyal@fyra:/sys/fs/cgroup$ cat /proc/16455/cgroup

11:name=systemd:/user/1002.user/5.session 10:hugetlb:/user/1002.user/5.session 9:perf_event:/user/1002.user/5.session 8:blkio:/user/1002.user/5.session 7:freezer:/user/1002.user/5.session 6:devices:/user/1002.user/5.session 5:memory:/user/1002.user/5.session 4:cpuacct:/user/1002.user/5.session 3:cpu:/user/1002.user/5.session 2:cpuset:/ ```

Mount existing cgroup hierarichies onto file system

mkdir /sys/fs/cgroup/cg1

cgroup basics

Sep 15, 2014

Source: https://www.kernel.org/doc/Documentation/cgroups/cgroups.txt

CGROUPS

Control Groups provide a mechanism for aggregation/participating sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.

Definitions

cgroup

A cgroup associates a set of tasks with a set of parameters for one or more subsystems.

Term | Example ------------------|--------- Set of Tasks | process ids Set of parameters | 1 or 2 CPUs Subsystem | CPUSET, MEMORY % or nodes.

subsystem

A subsystem is a module that makes use of the task grouping facilities provided by cgroup to treat group of tasks in particular ways.

c**group** provides the grouping of tasks with a set of parameters.

A subsystem is typically a "resource controller" that schedules a resource or applies per-cgroup limits. But it may be anything that wants to act on a group of processes e.g. a virtualization subsystem.

Hierarchy

A hierarchy is a set of cgroups arranged in a tree, such that every task in the system is in exactly one of the cgroups in the hierarchy, and a set of subsystems; each sub-system has system specific state attached to each cgroup in the hierarchy. Each hierarchy has an instance of the cgroup virtual filesystem associated with it.

Rule 1. A single hierarchy can have one or more subsystems attached to it.

Example: You create a single hierarchy for memory and cpu. It's good if you have one to one mapping. You can have:

/cg1: 40% memory, 60% CPU
/cg2: 60% memory, 40% CPU

Rule 2: Any single subsystem (such as cpu) subsystem cannot be attached to more than one hierarchy if one of those hierarchies has a different subsystem attached to it already.

Rule 3: A task can belong to only one cgroup in a hierarchy. All tasks on a system by default are member of default cgroup of that hierarchy, which is known as root cgroup.

Rule 4: Any process that forks itself creates a child task. By default child task inherits the cgroup membership of its parent but can be moved to different cgroups as needed.

source: https://access.redhat.com/documentation/en-US/RedHatEnterpriseLinux/6/html/ResourceManagementGuide/sec-RelationshipsBetweenSubsystemsHierarchiesControlGroupsandTasks.html

Precise Miss Analysis for Program Transformation with Caches of Arbitrary Associativity

Sep 12, 2014

Source: http://mrmgroup.cs.princeton.edu/papers/ghosh_asplos98.pdf
Published: 1998

Currently the gap between memory and processor is addressed by compile- and programmer- applied optimizations like matrix blocking, data structure padding and other program transformations.

This paper defined a CME (cache miss equation framework) which expresses memory references and cache conflict behavious in terms of set of equations.

Two main approaches to automatically improve data locality for loop-oriented programs:

Loop nest restructuring - includes permutation, tiling (parition loop space into chunks), fusion are used to reorder access patterns. These transformation have usually considered capacity misses in the past, however there could be conflict misses which are difficult to identify and are highly sensitive to problem size and base addresses.
data layout optimizations - padding to the data structures.

CMEs are a system of linear Diophantine equations

Assumptions:

Loops are assumed to be normalized, such that step value is 1.
Consider only perfectly nested loop and some imperfectly nested loop if they have one block between the nests.
Loops contain no conditional statements.
All load/store inside loop nest belong only to array.

Cache Miss Equations

Cold Cache Misses

Either the first access along the direction of vector.
accesses that have just crossed a memory line boundary along the direction of vector.

Replacement Miss Equations

Example: for an access stream - Ra, Rb, Ra. A conflict clearly occurs in direct mapped cache if Ra and Rb maps to same cache line.

This happens: Memory_Address_of_Ra = Memory_Address_of_Rb + n * Cache Size + Line_Size_Range

Roughly a conflict occurs if memory addresses accessed differs by multiples of cache size.

In case of k-way set associative cache, a miss occurs if between consecutive access to a memory line, at least k other accesses occur to distinct memory lines that map to the same cache set.

Algorithm

Find all reuse vectors along an iteration point.
Order all reuse vectors in lexicographical order.
Find potential conflict points: p = i - r
Solve CME equations for each reuse vector and the solution at a point could belong to cold miss or conflict miss.
If the solution satisfies cold miss equations, mark them as "intermediate points" otherwise it is a conflict miss.

Set-Associative Cache

Each CME soluction can be summarized using a triple (i,j,n). n denotes the cache "wrap-arounds", and same value of n indicates the conflict for same memory line.

For k-way associative cache, maintain a set of disctinct conflict points with same n and if the set size exceeds k, we have a conflict.

Solutions to CMEs

Padding

Find constraints on CMEs using cache base address, dimensions size so that no solution exists for CME. Decide padding based on those constraints.

cache basics

Sep 12, 2014

Performance of CPU increase very fast, however memory is not involving as fast (not the latency nor bandwidth of the bus).

So we put in cache closer to the processor. Caches are small. * Caches are faster because they are smaller. * Speed of light limits the speed for larger memories.

Locality

Programs tend to use data and insturctions with addresses near or equal to those recently used.

Temporal Locality

Recently referenced items are likely to be referenced again in the near future.
Replacement policy of cache.

Spatial Locality

Items with nearby addresses tend to be referenced close together in time.
Bring blocks of memory to the cache

sum = 0
for (i = 0; i < n; i++) {
  sum += a[i];
}
return sum;

In the above program:

Date:

Spatial Locality: array accessed with stride-1 pattern
Temporal Locality: sum referenced in each iteration

Instructions:

Temporal: cycle through loop iteratively.
Spatial: reference instructions in sequence.

Cache structure of core i7

Cache Line or Block Size

In a direct mapped cache, cache line is the number of bytes that you can store at a given index (can be extended to set associative)

Cache conflicts

Reads

Cold(compulsory misses): Occurs on first block of data.
Conflict Miss: Occurs when cache is large enough but multiple data objects all map to the same slot. e.g. referring block 0, 8, 0, 8.
Capacity Miss: Occurs when the set of active cache blocks is larger than the cache.

Writes

Data may be stored at multiple levels of memory. (copies are spread across levels)
On write-hit (write block is in cache)
- Write-through: Write immediately to memory
- Write-back (defer write to memory until the line is evicted): Set a dirty bit to indicate if line is different from memory or not.
On write-miss:
- write-allocate: load into cache, update line in cache. Good if more write to location follows.
- no-write-allocate: just write immediately to memory.
Typical Caches:
- write-back + write-allocate
- wite-through + no-write-allocate, occasionally in case of multiple processors.

Threads and Input/Output in the Synthesis Kernel

Sep 10, 2014

Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.51.3887 Published: 1989

Synthesis operating system kernel combines several techniques to provide high performance, including kernel code synthesis, fine grain scheduling and optimistic synchronization.

Goals: 1. High performance 2. self-tuning capability to dynamic load and configuration changes 3. a simple, uniform and intuitive model of computation with a high-level interface

Two kind of objects supported by synthesis kernel: Threads and I/O

Synthesis model of computation

To support parallel and distributed computing, the threads of execution form a directed graph, in which the nodes are threads and the arcs are data flow channels.

Synthesis threads are threads of execution. Some threads never execute user-level code, but run entirely within kernel to provide additional concurrency for some kernel operations. Threads execute program in quaspace (quasi address space) which also stores data. Finally, I/O devices move data between threads including files and messages.

No virtual memory in current implementation

On one physical node, all the synthesis quaspaces are subspaces of one single address space. Kernel blanks out parts of address space that each quaspace is not supposed to see. Since they are part of same quaspace, it is easy to share memory by setting their address mappings.

Kernel Code Synthesis

Factoring invariants: bypasses redundant computations.
The Collapsing layer method eliminates unnecessary procedure calls and context switches, both vertically for layered modules and horizontally for pipelined threads.
The executable data structure method shorten the data structure traversal time when DS is travelled always the same way.

Synthesis kernel can be devided into a number of (collections of procedures and data). These collection of procedures are called - quajects that encapsulate hardware resources (provide interface to hardware through quajects). Important quajects include threads and I/O devices servers.

Threads (quajects, that encapsulate hardware) are abstraction of CPU. The device servers are abstraction of I/O devices. Except from threads, quajects consist only of procedures and data.

quajects don't support inheritance or any other feature, events such as interrupts start the threads that animate the quajects and do work.

Building blocks of quajects:

monitors, queues and schedulers.
switches (equivalent to C switch statement, direct interrupts to the appropriate servie routines), pumps (contains thread that actively copies its input into its output.) and guages (counts events like procedure calls, data arrival and interrupts; used by scheduler).

pumps connect passive producers to passive consumers.

Queues

Queue Type => Synchronous (blocks at empty or full) and Asynchronous(signals at those conditions) => dedicated (only 1 producer or consumer, omits synchronization) and optimistic queues (accepts queue insert and delete from multiple prod;cons;)

Threads are created by Quaject Creator in 3 stages

Allocation - allocate memory - for quaject and procedures.
Factorization - uses factorizing invariants to substitute constants into quaject's code templates.
Optimization - do peephole optimizations.

# mnemonic
 -----\|\ -> [  - - \ - | \]  -> [ - - - c \ - c | ] -> [ - - - - c - - c ]

Quaject Interfacer

Starts the execution of existing quajects by installing (load procedures and data) them in invoking thread. 4 Stages:

find connecting mechanism (queue, monitor, pump or a simple procedure call) from 1 quaject to another.
Factorization and Optimization (Run time).
Dynamic link stage stores synthesized code's entry points to the quajects.

Reducing Synchronization

Three methods to reduce Synchronization:

Code Isolation (synchronization avoidance) -> kernel synthesis to separate independent modification of DS. e.g., Thread Table Entry (lock global table -> change local tables)
Procedure Chaining ( "" ) -> serializes the execution of conflicting threads. Signal during interrupt handling could be dangerous, so chain the prodedure invoked by signal to the end of interrupt handler (replace the return address - effecient.)
Optimistic Synchronization - update without synchronization, test at the end - if something bad happened -> rollback and retry.

Optimistic Queues

Queues are important since lot of functionality is implemented using queues. Four types - SP-SC, SP-MC, MP-SC, MP-MC

SP-SC - synchronization is only necessary when queue becomes empty or full. Usually busy wait. Solution: update the pointer (Q-head for producer, Q-tail for consumer) at the very end. In later architectures, you may have to put memory fence to ensure store order.
MP-SC - compare-and-swap instruction + retry loop. Now producers claim the space first by incrementing the Qhead and if successful they start inserting the elements. Consumers can't trust the head now, so producers update flags corresponding to queue positions indicating if the element is valid.

Compare and swap is atomic, so they update a local pointer and see if it's valid. If it's valid then they swap it with head. -> all other producers will see the updated head now while the current producer is inserting elements on claimed space.

Threads

The thread state is defined by TTE containing register save area, interrupt handlers, error traps and signal vectors; the address map tables, context-switch-in and out procedures.

Kernel generates some code for threads in a protected area - custom system calls, synthesized interrupt handlers such as for queue buffering, specialized eror trap handlers and thread calls (start, stop, etc.)

Thread switches to supervisor mode while running kernel call instead of kernel server running it for the thread. Kernel quaspace is made accessible.

Waiting queue is spread accross resources and waiting thread's unblocking procedure is chained to the end of interrupts handler.

Context Switch

Switch only the part of context being used. (most of the processes don't use FP co-processor so don't save state [default synthesized code] until used[then resynthsize code]).
use executable data structures to minimize critical path. - No "dispatcher" - context-switch-out of 1 process contains jmp to context-switch-in for next process.

Signal

Signal is an asynchronous software interrupt sent by a thread to another. Signal system call modifies a general area so that receiving thread can run the signal handler code in used mode.

Scheduling

Fine grained quanta to threads - based on "need to execute" determined by rate at with I/O data flows into and out of its quaspace.

Effective CPU = Quanta to thread / Quanta to all threads

priority -> assigning large quanta or placing in front of ready queue.

I/O

At boot time, kernel creates the servers for raw physical devices. A simple example pipelines the raw input from tty to filters. multiple filters may be applied during the process.

Producer/Consumer in I/O

Active producer - Passive Consumer (or vice versa). In case of SP-SC, a procedure call from consumer asking producer for data does the job.

For MP-SC or SP-MC, attach monitor at the multiple end resulting in MP-SC or SP-MC queues.

Passive producer - Passive Consumer -> xclock -> clock producer ready to produce time and clock displayer ready to display at all times. We use PUMP here that reads from one end and writes to another.

Interrupt Handlers

Each threads has its own synthesized interrupt handellers and system calls. Lot of interrupts can be handelled in current execution thread only so minimal saving of registers is required and it avoid a context switch. During short level interrupt, higher level interrupts may happend and those inerrupts are chained.

file system continued

Sep 9, 2014

fsync: transfers (flushes) all modified in-code data of the file to the disk drive. The call blocks until the device reports that the transfer has completed.

fdatasync: is similar to fsync but it doesn't flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled

source: http://www.hackinglinuxexposed.com/articles/20030616.html ``` flock(fd, operation) Locks or unlocks an entire file. Doesn't work on NFS mounted filesystems, unfortunately. Available operations are

LOCK_SH Create a shared lock
LOCK_EX Create an exclusive lock
LOCK_UN Release our lock
LOCK_NB Don't block when setting the lock. Return an error if the action cannot be completed.

lockf(fd, operation, offset) Locks or unlocks the portion of the file after offset. Allows you to lock just trailing portions of the file, if desired. lockf is just a front end for fcntl. Arguments:

F_LOCK  Create an exclusive lock
F_TLOCK Create an exclusive lock or return an error immediately
F_ULOCK Release our lock
F_TEST  Test if the file is locked by another process.[3]

fcntl(fd, command, struct flock ); fcntl offers you the most locking control. The flock structure details the beginning and end of the segment to lock. It has arguments analogous to those for lockf and can differentiate between read and write locks. (fcntl can do other things too, such as setting the close on exec flag, duplicating a file descriptor, and more.) ```

Virtual File System

file lookup

source: Profession Linux Kernel Architecture, chapter 8

Symbolic links are directory enteries which doesn't contain any data but just a pointer to filename. A separate inode is used for each symbolic link. The data segment of inode contains a character string that gives the name of the link target.

File descriptors with introduction of multiple namespaces and linux containers can no more uniquely identify a file. A unique representation is provided by a special data structure (struct file).

verification by model checking

Sep 9, 2014

Formal verification techniques can be thought of comprising of three parts:

a framework for modelling systems, typically a description language.
a specific language for describing the properties of the world.
a verification method to establish whether the description of the system satisfies the specification.

Approaches to method verification:

Proof Based: Here the system description is a set of formulas Γ and specific language is another formula φ. The verification method consists of trying to find a proof that Γ |− φ.
Model Based: the system is represented by model M, specification is again by a formula φ and the verification method consists of computing whether a model M satisfies φ.

Exokernel: an OS architecture for application-level resource management - Notes

Sep 7, 2014

source: http://research.cs.wisc.edu/areas/os/Qual/papers/exokernel.pdf
published: SIGOPS 1995

Traditionally operating systems define an interface between physical resources and applications. OS provides several abstractions like processes, files, address spaces and interprocess communication.

These hardcoded abstractions limits:

Domain specific optimizations
Discourages chages to the implementatinos of current abstractions
makes it difficult or impossible the applications to define new abstractions

In exokernel these problems are solved by using application level (i.e., untrused) resource management. In exokernels, virtual memory and interprocess communication are implemented at application level. A minimal OS - exokernel basically multiplexes available hardware resource. These application level implementations are provided as libraries which application developers can directly or implement their own. Exokernel tries to implement a very low level interface, thus leaving a lattitude for more domain specific optimizations at the higher (application) level.

Exokernel rather than emulating hardware (virtual machines, high overhead), exports hardware resources. It employs three techniques to export resources securly:

secure bindings: applications can securely bind to machine resources and handle events.
visible revocation: applications participate in a resource revocation protocol
abort protocol: an exokernel can break secure bindings of uncooperative applications by force.

use cases of exokernels:

Page tables are application dependent and using domain specific implementation each application may implement their own page tables in exokernel. For example in database implementation.
The same is true for cache replacement policy, a paper shows that application runtime can be improved by 45% by implementing application specific caching policies.
End-to-end arguments applies to exokernels and new tecchnology can be more easily adapted by providing library operating systems (application level resource management libraries).

Library Operating Systems

Library operating systems need not multiplex a resource so they can be easuly specialized without consideration of resource management.
Since lib OSs are treated as untrusted applications by exokernel , lib OS can trust the applications (which makes them easy to design). In case of invalid params to lib OS, exokernel will check the input (since untrusted) and declare it as invalid. No other applications will be effected.
Finally there will be less kernel crossings since most the OS runs in address space of applications.

Exokernel Design

The challenge of exokernel is to provide maximum freedom to applications in managing physical resource while protecting them form each other. Exokernel separates protection from management through low level interface

Exokernel Notes

Deadlock conditions: * mutual exclusion * non-preemptive * --- * ---

Thrashing vs Livelock

End-to-end argument: putting reliability in your system, when you have end-to-end reliability checking can only be viewed as optimization.

basic os concepts

Sep 4, 2014

Memory Allocation/Memory Management

Act of managing computer memory. Essential requirement is to provide ways to dynamically allocate memory to programs at their request and free when not needed.

Memory Pools/Fixed-size block allocation: uses a free list of often same size block of memory and allocates on request. This works well for embedded systems.
buddy blocks: In this system memory is allocated from several blocks instead of just one. Memory pools often represent blocks of size of power 2. When a certain size of memory is requested, the pool with minimum size meeting the requirements is used to allocate memory. If no such block is present, a block from higher pool is broken into smaller blocks and then memory is allocated.

Fragmentation

Three types: External, Internal and Data fragmentation

Since the memory is often assigned in blocks of power 2, there would be a waste of memory either internally or externally. Overtime while memory is getting allocated and released, there will holes in between. If there are holes outside the allocated area, it's called external fragmentation and if there are holes inside the allocated area it's called internal fragmentation. Data fragmentation occurs when a collection of data in memory is broken into pieces that are not closed together. For example, a file that is not able to fit in contiguous memory.

Translation look-aside buffer (TLB)

A cache that Memory Management Unit uses to improve virtual address translation speed.

Virtual memory is the space seen from a process. The page table keeps track of where the virtual pages are stored in the physical memory. The TLB is implemented as CAM (content addressable memory). CAM key is virtual address and search result is the physical address. If there's a miss in TLB, translation proceeds by page walk.

Interrupt

Interrupt is a signal to the processor emitted by software or hardware indicating an event that needs immediate attention. The processor responds by suspending its current activities, saving its state and executing a function called interrupt handler (or interrupt service routine, ISR).

Hardware Interrupts are used by devices to communicate that they require special attention. Internally they are generated by sending alerting signals to the processors. Basically give each device a wire (interrupt line) that it can use to signal processor. Hardware interrupts are asynchronous.

Software Interrupts is cause either by an exceptional condition in the processor itself or a special instruction in the instruction set which causes an interrupt when executed. The former is often called a trap (Divide by 0 causes trap) Special instruction is used to interrupt to request services from low level system software such as device drivers. Software interrupts are often used to implement system calls.

Several ways to implement interrupts. Could be implemented using a active wire (level triggered, high or low), edge triggered, message signaled, Doorbell wiki link.

Distributed Shared Memory

Physically separate memories can be addressed globally. Two processors with same address will point to same memory address. Can be implemented at OS level or library level.

Stateful and stateless servers

A stateful server remembers client data from previous request whereas stateless server doesn't.

State may refer to any information related to client like whether the file is open, or being modified or cached data on client etc.

NFS is a stateless system. - Each request must be complete since no state is stored - the file has to fully identified and any offsets specified. - if a server crashes and then recovers, no info was lost since it's stateless. higher degree of fault tolerence. - no server overhead for storing client information, and server doesn't get affected if client crashes.

AFS is stateful System - Requests are shorter - cache coherence is possible; server can know which clients are caching which blocks of a file. - File locking is possible; server can maintain that certain client is locking a file. - server returns an identifier to client creating a session and keeps track of open files. It is required to maintain state if providing unix file semantics.

Difference between stateful and stateless file systems

Failure Recovery:
- A stateful server loses all its valatile state in a crash. Server needs to be aware of client failure to reclaim memory.
Slow processing and longer request messages in case of stateless systems. Difficult to provide unix file semantics.
A server employing server-iniitiated cache needs to have state to validate cache.
UNIX use of file descriptors and implicit offsets is inherently stateful. servers must maintain tables to map file descriptors to inodes and store current offset within file.

Detailed Info

Swapping

When all the process in a multiprogrammed CPU are not able reside in memory at the same time, some of the process that are not scheduled currently or are blocked are transferred to secondary memory. This dynamic moving of process between memotu and disk is called swapping.

Inverted Page Table

Linear Inverted Tables: Rather than mapping virtual addresses to physical addresses, we map physical address to virtual addresses. http://www.cs.berkeley.edu/~kamil/teaching/sp04/040104.pdf

File Systems

Give at least three strategies that a file system can employ to reduce the time a program spends waiting for data reads to complete.

Prefetching (Spatial Locality) - A process doing random access will not get the benifits.
Indexing (Index Most Used Paths) - Accessing files that are not used commonly will be effected.
Caching - If the data is being changed rapidly, cache will become stale soon and benifits will diminish.

http://duartes.org/gustavo/blog/post/anatomy-of-a-program-in-memory/

kernel compilation notes

Sep 4, 2014

Kernel Versions

# Guest
goyal@fyra:~$ uname -a
Linux fyra 3.13.0-34-generic #60-Ubuntu SMP Wed Aug 13 15:45:27 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

# Host
ankit@ubuntu2:~$ uname -a
Linux ubuntu2 3.16.1 #1 SMP Thu Sep 4 18:57:37 CDT 2014 x86_64 GNU/Linux

Create VM

Install KVM and it's dependencies.
Create a Ubuntu VM using VMBuilder

sudo vmbuilder kvm ubuntu     \
    --arch i386               \
    --user ankit              \
    --name ankit              \
    --pass cracker            \
    --suite lucid             \
    --flavour virtual         \
    --addpkg openssh-server   \
    --libvirt qemu:///system

/udev notes

Sep 4, 2014

Problems with `/dev`

A static /dev is unwiedly big. It would be nice to show only /dev enteries for the devices that we acutally have running in the system.
We are running out of major and minor number for devices.

  [vagrant@vagrant-centos64 ~]$ ls -l /dev/sda
  brw-rw---- 1 root disk 8, 0 Aug 14 05:36 /dev/sda

  # shows permissions (brw-rw---- ), owner (root), group (disk), major device number (8),
    minor device number (0), date (Aug 14), hour (05:36) and device name

When accessing a device file, the major device number selects which device driver is being called to perform read/write. Minor number is passed as a parameter and its usage/interpretations depends on the driver.

Users want a way to name devices in persistant fashion.
Userspace programs wants to know what devices are created or removed and what /dev entry is associated with them.

The Unix Time-Sharing System Notes

Sep 2, 2014

source: http://cm.bell-labs.com/cm/cs/who/dmr/cacm.html
published: ACM 1974

The File System

From the user's point of view, three kinds of files:

ordinary disk files - no particular structure expected by system. Structure of files is controlled by programs (object files generated by assembler). Could be binary files or simple text files.
directories - mapping between the names of files and the files themselves.
- System maintains several directories for its own use.
  - root directory: the starting point for other files in the system.
  - directory that contains commands to be executed. (not required, for convinience since commands can be placed anywhere).
  - null filename refers to current directory.

Unique Abbreviation for a list of words

Aug 31, 2014

Find unique abbreviation for a given list of words.

#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
#include <queue>

using namespace std;
/*
 * Method: abbreviateWord(string word, int offset)
 *
 * Example: internationalization, 1: i18n
 *          auto, 1: a2o
 *          automatic, 2: au5ic
 * */

string abbreviateWord(string word, int offset) {
  int length = word.length();
  int numeric_part;
  if (offset > length) {
    return NULL;
  }
  string prefix = word.substr(0, offset);
  string suffix = word.substr(length-offset, length-offset);
  numeric_part = length - (2 * offset);
  string result;
  if (numeric_part < 1)
    result = prefix + suffix;
  else
    result = prefix + to_string(numeric_part) + suffix;
  return result;
}

/** Returns unique mapping of abbreviations to words for a given list of words.
 * Input:
 * words: vector of words of string datatype.
 *
 * Return:
 * map with keys as abbreviations and values as the corressponding words.
 *
 * Example:
 * Input: ["abcdef", "abpqef", "internationalization", "something"]
 * Output:
 *
 * acdef ->  ac1ef
 * doordonotthereisnotry ->  d19y
 * cool ->  c2l
 * aqref ->  aq1ef
 * internationalization ->  i18n
 *
 * */
unordered_map<string, string>& uniqueAbbreviation(const vector<string>& words) {
  // maintain a hash map with abbv -> word mapping.
  unordered_map<string, string> *abbvWord = new unordered_map<string, string>();
  // maintain a queue for processing words.
  queue<string> wordQueue;
  int offset = 1; // initial offset
  // loop until all done
  for (auto &it : words) {
    wordQueue.push(it);
  }
  string word, abbv;
  int count = wordQueue.size();
  while(!wordQueue.empty()) {
    word = wordQueue.front();
    wordQueue.pop();

    abbv = abbreviateWord(word, offset);
    if ((*abbvWord)[abbv] == "") {
      (*abbvWord)[abbv] = word;
    } else {
      wordQueue.push(((*abbvWord)[abbv]));
      abbvWord->erase(abbv);
      wordQueue.push(word);
    }
    count--;
    if (count == 0) {
      count = wordQueue.size();
      offset++;
    }
  }
  return *abbvWord;
}

int main(void) {
  std::cout << "Internationaalization, 1: " << abbreviateWord("internationalization", 1) << endl;
  std::cout << "Internationalization, 2: " <<  abbreviateWord("internationalization", 2) << endl;
  std::cout << "abcdef, 3: " <<  abbreviateWord("abcdef", 3) << endl;
  vector<string> words = {"internationalization", "cool"};
  words.push_back("doordonotthereisnotry");
  words.push_back("acdef");
  words.push_back("aqref");

  unordered_map<string, string> &abbvWords = uniqueAbbreviation(words);

  for(auto &it : abbvWords) {
    std::cout << it.second << " ->  " << it.first << std::endl;
  }

  return EXIT_SUCCESS;
}

Swap Using Reference

Aug 31, 2014

Write a method to swap two integers passed by reference.

#include <iostream>
// swap using pointers
void swap(int &a, int &b) {
  int temp(a);
  a = b;
  b = temp;
}

int main(void) {
  int a = 1;
  int b = 2;
  std::cout << "a = " << a << ", b = " << b << std::endl;
  swap(a, b);
  std::cout << "a = " << a << ", b = " << b << std::endl;
  return EXIT_SUCCESS;
}

hola mundo

Aug 31, 2014

the hola mundo program.

class HolaMundo {
    public static void main(String[] args) {
        System.out.println("Hello World!"); // Display the string.
    }
  }