Retrospects

Longest Text Producible in Editor

2013-01-10T15:34:00.000+01:00

I saw an unsolved problem at http://intearview.com:

Problem

You are provided with four possible operations that can be done on the editor(each operation requires 1 keyboard hit)

A
Ctrl+A
Ctrl+C
Ctrl+V

Now you can hit the keyboard N times and you need to find the maximum number of A's that can be printed. Also print the sequence of keyboard hits.

Algorithm

The first intuition on seeing this problem is we should use dynamic programming, as this problem has a clear structure of sub-problems. A straightforward trial is to distinguish different cases by the last operation. Nevertheless, difficulty arises when the last operation is Ctrl+V and the second last operation is not Ctrl+C.

In order to solve this problem, we have to further dissect the properties of an optimal edit sequence:

Ctrl+A always immediately precedes Ctrl+C, otherwise it cannot be optimal.
If a Ctrl+V follows an A, it is safe to push Ctrl+A backward to bypass A.
If a combo Ctrl+A Ctrl+C precedes an A, it is better to push the combo forward to bypass A. That is the original edit sequence cannot be optimal.

In summary, an optimal edit sequence can be shaped in a normal form, in which Ctrl+V's are grouped together, and a combo Ctrl+A Ctrl+C immediately precedes each group.

As a consequence, if we denote the longest sequence of n operations by C(n), the recursion can be expressed as:
$$C(n) = max(1+ C(n-1), max_{i}((i+1)\times C(n-2-i)))$$
in which $1+C(n-1)$ is for the case when the last operation is A, and the latter part is for the case when a continuous block of Ctrl+V is at the end of the optimal edit sequence. Since we don't know the optimal size of the block, we have to do a linear scan which amounts to an algorithm of $O(n^2)$

Code

static String[] OPERATIONS = {"A", "Ctrl+A", "Ctrl+C", "Ctrl+V"};

int longestText(int n) {
  assert n > 0;
  int[] length = new int[n + 1];
  length[0] = 0;
  length[1] = 1;

  int[] optBlockSizes = new int[n+1]; // optimal size of Ctrl+V block ending at each position
  optBlockSizes[1] = 0; // no Ctrl+V block, i.e., A

  for (int i = 2; i < n+1; i++) { // for each length of edit sequence
    int len = length[i-1] + 1;
    optBlockSizes[i] = 0; // i.e., A
    for (int j = 1; j < i-2; j++) { // for each size of Ctrl+V block
     int newLen =  (j+1)*length[i-2-j];
     if (newLen > len) {
       len = newLen;
       optBlockSizes[i] = j;
     }
    }
  }
   
  // print the edit sequence 
  printEditSequence(optBlockSizes);

  return length[n];
}

void printEditSequence(int[] pastBlocks) {
   // the array is [0,n], with the first element useless
   int n = pastBlocks.length - 1;
   String[] editSequence = new String[n+1];
   printEditSequence(editSequence, pastBlocks, n);
}

// last is the largest unprocessed element index
void printEditSequence(String[] sequence, int[] blocks, int last) {
   if (last == 0) {// print sequence[1:n];
     for (int i = 1; i < sequence.length-1; i++) {
       System.out.print(sequence[i]+", ");
     }
     System.out.println(sequence[sequence.length-1]);
   }
   else {
     if (blocks[last] == 0) {
       sequence[last] = "A";
       printEditSequence(sequence, blocks, last-1);
     }
     else {
       int blockSize = blocks[last];
       for (int i = last; i > last - blockSize; last--) {
         sequence[i] = "Ctrl+V";
       } 
       sequence[last-blockSize-1] = "Ctrl+A";
       sequence[last-blockSize] = "Ctrl+C";
       printEditSequence(sequence, blocks, last-blockSize-2);
     }
   }
}

Fast Exponentiation

2012-11-28T12:16:00.000+01:00

Computing the power of a number is quite common. A simple form appears frequently in coding interviews is to compute the power of integers.

Problem: given two integers n and k, compute the number $ n^k $.

One well-known algorithm is to perform exponentiation by repeated squaring. I saw the following pseudo code in Udi Manber's book:

Input: n and k 
Output: P
P := n;
j := k;
while j > 1 do
  P := P*P;
  if j mod 2 = 1 then
    P := P*n;
  j = j div 2;
end

The code is wrong, as can be seen by testing k = 5, n = 2. The above program produces 16, while 32 is the expected result. The simplest implementation of the repeated squaring is to use recursion, which is difficult to be wrong:

long power(int n, int k) {
  assert n > 0 && k > 0;
  if (k == 1) return n;

  if (k & 1 == 1) { // odd
    return n * power(n, k-1);
  }
  else {
    long half = power(n, k/2);
    return half*half;
  }
}

The pseudo code we presented at the beginning is iterative, and hence more efficient. However, as no correct loop invariant is maintained in it, it fails to compute the desired value. A simple intuition to compute the $n^k$ is to scan the exponent k bit-wise from low to high. Every time we check whether the current bit i is 1, if so, we multiply the corresponding $2^i$. There are all together $log_2(k)$ bits.

long power_iterative(int n, int k) {
  assert n > 0 && k > 0;
  int b = k; // high bits to be processed
  long result = 1;
  long base = n;
  while (b != 0) {
    // invariant: result*(base^b) == n^k
    if (b & 1 == 1) {
      result *= base;
    }
    base *= base;
    b >>= 1;
  }
  return result;
}

If the arguments are not guaranteed to be positive, we have to take extra care.

double power(int n, int k) {
  if (n == 0 && k <= 0) 
    throw new RuntimeException("For 0, only positive exponent allowed!");
  if (k == 0) return 1;
  if (n == 0) return 0; 

  int sign = 1; 
  if (n < 0) {
    sign = (k & 1) == 1 ? -1 : 1;
    n = 0 - n;
  }

  long result = sign;
  result *= power_iterative(n, Math.abs(k));
  
  return k > 0 ? result : 1.0/result;
}

One application of fast exponentiation is to compute Fibonacci number. We recall that the Fibonacci number can be expressed in a matrix form: $$ \begin{bmatrix} f(n+1) \\ f(n) \end{bmatrix}= \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}\times \begin{bmatrix} f(n) \\ f(n-1) \end{bmatrix} $$ If we denote by A the matrix: $ \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} $, then we know $$\begin{bmatrix} f(n+1) \\ f(n) \end{bmatrix} = A^{n}\times\begin{bmatrix} f(1) \\ f(0) \end{bmatrix}$$
The algorithm is in O(log(n)), while an iterative accumulation algorithm takes O(n).

class Matrix{
  long a, b , c, d;
  Matrix(long i , long j, long k, long l) {a=i; b=j; c=k; d=l;}
  Matrix(Matrix m) {a = m.a; b = m.b; c = m.c; d = m.d;}
}
Matrix multiply (Matrix n, Matrix m) {
   return new Matrix(n.a*m.a+n.b*m.c, n.a*m.b+n.b*m.d, n.c*m.a+n.d*m.c, n.c*m.b+n.d*m.d);
}
Matrix fastExponentiation(Matrix m, int k) {
  assert k > 0;
  Matrix base = new Matrix(m);
  Matrix result = new Matrix(1, 0, 0, 1);
  int b = k;
  while (b != 0) {
    if(b & 1 == 1) result = multiply(result, base);
    base = multiply(base, base);
    b >>= 1;
  }
}
long fibonacci(int n) {
  assert n >= 0;
  if (n == 0 || n== 1) return 1;
  Matrix m = new Matrix(1, 1, 1, 0);
  Matrix power = fastExponentiation(m, n);
  return power.c;
}

Triangle in Undirected Graph

2012-11-23T17:04:00.002+01:00

Problem

Given an undirected graph, design a O(V*E) algorithm to detect whether there is a triangle in the graph or not.

Algorithm

The idea of the algorithm comes from Manber's book. The central observation is that in order to form a triangle, say (i, j, k), then the corresponding rows of the first two vertices i and j should have both 1 on the k-th column. We can simply do a linear intersection to check this. However, in addition, we have to check that vertex i appears in the j-th row and vice versa. Using adjacency matrix, the complexity is then O(V^3), with V being the number of vertices. For sparse matrix, a O(V*E) algorithm is more efficient. Instead of using adjacency matrix, we now use adjacency lists. We can do the intersection using a procedure similar to merge-sort. More specifically, we perform the following:

sort all the adjacency lists
for each two pair of vertices

intersects the edges to find common end points
meanwhile check whether the pair is an edge

To see such an algorithm is O(V*E), we have to notice that each adjacency list has been intersected with all other V-1 adjacency lists. If we denote by $L_i$ the length of the adjacency list incident on node i, then the total intersection time is: $$\Sigma_i (V-1)L_i = (V-1)\Sigma_i L_i = (V-1)E$$ We also notice that the cost of sorting the adjacency lists are: $$\Sigma_i L_ilog(L_i) \leq \Sigma_i L_ilog(E) = ElogE$$ To sum up the complexity is O(ElogE+V*E) = O(V*E).

Code

class Graph{
  List<Integer > [] adjLists; // adjacency lists
  int n; // size of graph
}

boolean containsTriangle(Graph g) {
  // sort all adjacency lists in arrays
  Integer[][] arrays = new Integer[g.n][];
  for (int i = 0; i < g.n, i++) {
    arrays[i] = g.adjLists[i].toArray(new Integer[0]);
    Arrays.sort(arrays[i]);
  }
  
  // iterate all pairs
  for (int i = 0; i < g.n; i++) {
   for (int j = i + 1; j < g.n; j++) {
     // test intersection and connectivity
     int pi = 0;
     int pj = 0;
     boolean intersected = false;
     while (pi < arrays[i].length && pj < arrays[j].length) {
       if (arrays[i][pi] == arrays[j][pj]) {
         intersected = true;
         break;
       }
       else if (arrays[i][pi] < arrays[j][pj]) {
         pi++;
       }
       else { 
         pj++;
       }
     }

     // check connectivity by checking whether i is in list of j
     boolean connected = false;
     for (Integer k : arrays[j]) {
       if (k == i) { 
         connected = true;
         break;
       }
       else if (k > i) {
         break;
       }
     }

     if (intersected && connected) return true; // early return
   }
  }
  return false;
}

Minimal Truncation

2012-11-23T14:54:00.001+01:00

I saw an interview question on CareerCup.

Problem Description

Given an array A of positive integers. Convert it to a sorted array with minimum cost. The only valid operation are:
1) Decrement with cost = 1
2) Delete an element completely from the array with cost = value of element

Algorithm Development

An intuitive idea is to use dynamic programming. The analysis is as follows:

find the min and max of the array int[] a in O(n) time, with n being the size of the array
maintain a n*(max-min+1) table memorizing minimal cost (MC): MC(i, k) is the minimal cost of transforming the subarray a[i:] with a lower bound k
set up recursion

base case MC(n-1, k) = 0 if k<=a[n-1] or a[n-1] if k>a[n-1]
given the minimal costs of subarray a[i+1:], estimate MC for a[i:]

if k > a[i], a[i] has to be deleted, otherwise the lowerbound is violated, which leads to MC(i, k) = a[i] + MC(i+1, k) (denoted by r1). The minimal cost of a[i:] with lower bound k is then the cost of deleting a[i] plus the minimal cost of the subsequence a[i+1:] with the same lower bound.
if k <= a[i], it is safe to always reserve a[i], because any truncated subsequence from a[i+1:] can be improved by prepending a possibly truncated portion of a[i]. However, it is unknown how much a[i] should be truncated to achieve optimality. What we know is the optimal sorted array for a[i:] is a truncated a[i] prepended to a subsequence from a[i+1:]. The key is to enumerate all the possible lower bounds for the subarray a[i+1:], which are the range [k, a[i]]. The lower end of the range is from the lower bound of the current sequence, while the upper end of the range is the largest value that may make a difference for the cost. For lower bound larger than a[i], there is always no cost for truncating a[i]. Moreover, MC(i, K1) < MC(i,K2) if K1 < K2. We hence have:
r2) MC(i, k) = min_{k<=j<=a[i]}{MC(i+1, j) + (a[i] - j)}
The minimal cost is then taken as the minimum of all candidate new lower bounds for subarray a[i+1:].
The cost is then the truncation cost (a[i]-j) plus the cost from subarray a[i+1:].
To make the recursion simpler, we can rewrite r2) as:
r3) MC(i, a[i]) = MC(i+1, a[i])
r4) MC(i, j) = min{MC(i, j-1), a[i]-j + MC(i+1, j)} for j from a[i]-1 to min

Finally, the minimal entry in the first column MC(0,k) is the minimal cost. If we need to find the truncation, just search in MC column wise bottom up for the first switch of the costs. If MC(0, k)->MC(0,k+1) is the first switch, then k is the maximal lower bound in a[0:]. Truncate a[0] to k, if smaller than k, remove a[0]. Now search right up in MC and similar for remaining columns.

Code

// time O(n*max); space O(n*max), reducible to O(n) if only minimal cost is needed
int findMinCost(int[] array){
    int min, max, n;
    min = findMin(array);// feasible in O(n)
    max = findMax(array);// feasible in O(n)
    n = array.length;
    
    // memory table for estimates
    int[][] mc = new int[n][max+1];// wasted a little bit memory; we only use mc[i][min:max]
 
    // initialize last column
    for(int k = min; k<=array[n-1]; k++){
        mc[n-1][k]=0;      
    }
    for(int k = a[n-1] + 1; k <= max; k++){
        mc[n-1][k] = a[n-1];
    }



    // fill up table column by column backwards
    for(int i = n-2; i >= 0; i--){
        for(int k = max; k>array[i]; k--){
            mc[i][k] = array[i]+mc[i+1][k];
        }
        
        mc[i][array[i]] = mc[i+1][array[i]];
        
        for(j = array[i]-1; j>=min; j--){
            mc[i][j] = Math.min(mc[i][j-1], array[i]-j+mc[i+1][j]);
        }
    }
    
    return mc[0][min];
}



/**
* Time: O(n+max)
*/
Integer[] extractTruncatedArray(int[] array, int[][] mc, int min, int max){
    ArrayList list = new ArrayList();
    int lowerBound = min;//initial lower bound
    for(int i = 0; i  < array.length; i++) {// extract truncated elment at i
      // find first switch
      while(lowerBound < max && mc[i][lowerBound] == mc[i][lowerBound+1]){
        lowerBound++;
      }
      if(a[i] >= lowerBound){//truncate a[i]
            list.add(lowerBound);
      }//otherwise a[i] is deleted
    }
    return list.toArray(new Integer[0]);
}

Rang Minimum Query

2012-11-20T15:32:00.000+01:00

Range minimum query is the problem to answer algorithmically the minimum of an arbitrary range in the form of [lo, hi], given a large integer array a[0:n). Of course without pre-processing, each query is O(n), which is unacceptable if the queries are frequent. Another straightforward way is to memorize the answer to all ranges and perform a table lookup for the answer. Though this leads to an O(1) query answering operation, the pre-processing takes $O(n^2)$ time. More problematic, the space complexity is also $O(n^2)$. For an array to an order of $10^6$, the space consumption already exceeds the memory size of normal commodity machines nowadays. Our goal is derive an algorithm that has a fast enough query answering time and meanwhile is not too expensive regarding pre-processing.

We first identify the API of the problem:

void preprocess(int[] a);

int min(int lo, int hi);

We denote the complexity explicitly into two parts: pre-processing and query answering. For instance, the direct processing without pre-processing is of complexity $O(1)-O(n)$, whereas the memorizing-all approach is of complexity $O(n^2)-O(1)$.

1.Disjoint Segment Algorithm $O(n)-O(\sqrt{n})$

The first algorithm comes from an interview. The idea is to divide the whole array into segments and maintain a summary (in our case, the minimum) of the segment. When answering a query, we need to gather all the segments overlapping with the query range and return an answer. Because the covered segments are already aggregated during the pre-processing time, the computation is saved. One question comes into mind: how do we determine the length of a segment. We notice that a range may cover several segments completely, and overlap with at most two segments additionally. Therefore, we can optimize the estimated cost of a query, depending on whether we opt for worst case or average case. As for average case, we can simply choose each segment to be of length $\sqrt{n}$.

int[] segmentMin; 
int segmentLen;

void preprocess(int[] a) { 
  assert a != null && a.length != 0; 
  int len = a.length;
  segmentLen = Math.sqrt(len);

  segmentMin = new int[Math.ceil(len/segmentLen)];

  for (int i = 0; i*segmentLen < len; i++) {
       // invariant: already processed i segments
       int min = a[i*segmentLen];
       int upper = Math.min((i+1)*segmentLen, a.length);
       for (int j = i*segmentLen + 1; j < upper; j++) {
           min = Math.min(min, a[j]);
       }
       // assert min == range_min[i*segmentLen,
       //               (i+1)*segmentLen);
       segmentMin[i] = min;
  }
} 

int min(int lo, int hi) {
  assert hi > = lo;
  int rangeMin = a[lo];

  // assert segmentLen != 0;
  int startSegment = Math.ceil(lo/(float)segmentLen);
  int endSegment = Math.floor(hi/(float)segmentLen);

  // collect summaries from fully covered segments
  for (int i = startSegment; i <  endSegment; i++) {
      rangeMin = Math.min(rangeMin, segmentMin[i];
  }
   
  if (lo % segmentLen != 0) { // there is a partial overlap
    for (int i = lo; i < segmentLen*startSegment; i++) {
      rangeMin = Math.min(rangeMin, a[i]);
    }
  }  
  
  if (hi % segmentLen != 0) { // partial overlap
    for (int i = endSegment*segmentLen; i <= hi; i++) {
      rangeMin = Math.min(rangeMin, a[i]);
    }
  }

  return rangeMin;
}

In the preprocessing phase, we scan the whole array. Therefore, the complexity is O(n). Each query will collect at most $O(\sqrt{n})$ summaries, and scan two segments, each of which is of length $\sqrt{n}$. Therefore, the query is performed in $O(\sqrt{n})$ time.

2. Variable Length Summary Algorithm $O(nlog(n))-O(1)$

I saw this algorithm in Rujia Liu's book on OI. Compared to the disjoint segment algorithm, this algorithm stores more summaries on overlapping segments. The rise in the number of summaries gives better query answering performance. In one sentence, we store at each index i the range minimum in a segment of length of the form $2^j$. As there are n positions in the array, and the number of different length options are log(n), we are able to store all the summary in a table of the size nlog(n). When we need to process a query [lo, hi], we first determine the length of the range, say L. Let k be $\lfloor log(L) \rfloor$, the range minimum is then minimum of the two summaries at positions (lo, k), (hi + 1- L/2, k).
We now give the complete code in Java.

int[][] summary; // table of the form summary[beginIndex][lengthIndex]
int length; // array length
int level; // maximal exponent of a range length

void preprocess(int[] a)  {
  assert a != null && a.length > 0;
  length = a.length;
  int level = Math.ceil(Math.log(length)/Math.log(2));
  summary = new int[length][level+1];
  
  for (int i = 0; i < length; i++) {
    summary[i][0] = a[i];
  }  

  // build summary bottom-up
  for (int l = 1; l < level + 1; l++) {
    // invariant: already built l levels
    for (int i = 0; i < length; i++) {
      summary[i][l] = Math.min(getSummary(i, l-1), getSummary(i + 1<<(l-1), l-1));
    } 
  }
  
}
// handle overflow index
int getSummary(int pos, int level) {
  if (pos >= length) {
    pos = length - 1;
  } 
  return summary[pos][level];
}

int min(int lo, int hi) {
  assert hi >= lo;
  if (hi == lo) return summary[lo][0]; // a single element
  int rangeLen = hi+1-lo;
  assert rangeLen > 1;
  int halfRangeLevel = Math.ceil(Math.log(rangeLen)/Math.log(2)) - 1;
  assert halfRangeLevel >= 0;
  return Math.min(getSummary(lo, halfRangeLevel),
                  getSummary(hi + 1 - (1 << halfRangelevel), halfRangelevel)
                 );
}

Since each range query only incurs two table lookup operations, we know it is O(1). The crux of O(nlog(n)) preprocessing time lies in the fact we employ a bottom-up dynamic programming computation for the summaries.

3. Hierarchical Summary Algorithm $O(n)-O(log(n))$

This algorithm is inspired by an exercise in Udi Manber's book on algorithms. The basic idea is to organize hierarchical summaries in a tree, with the array elements as the leaves. Building the tree bottom-up requires O(n) time and space, while query requires O(log(n)) time to navigate the tree. Different from the disjoint summary, the summaries are overlapping. Furthermore, also different from the variable length summary, we now build summaries in a hierarchy in which a parent summary covers the range of two children. The summaries now form a tree. In order to make the bottom-up building convenient, we use linked lists to memorize nodes on the same level.

class Node {
  int min; // range min
  int lo, hi; // remembering range [lo, hi)
  Node left, right;
  Node(int l, int h, int m, Node left, Node right) {
    lo = l; hi = h; min = m; this.left = left, this.right = right;
  }
  Node (in l, int h, int m) {
    lo = l; hi = h; min = m; 
  }
}


Node root;

void preprocess(int [] a) {
  assert a != null && a.length > 0;
  int level = Math.ceil(Math.log(a.length)/Math.log(2));
  LinkedList [] nodesByLevel = new LinkedList[level + 1];
  nodesByLevel[0] = new LinkedList();
  for (int i = 0; i < a.length; i++) {
    nodesByLevel[0].add(new Node(i, i+1, a[i]));
  }

  for (int l = 1; l < level + 1; l++) {
    // invariant: already processed l levels
    // invariant: level l summarizes range of length 2^l (except for boundaries)
    LinkedList pre = nodesByLevel[l-1];
    nodesByLevel[l] = new LinkedList(); // current level
    Node first = null, second = null;
    Iterator it = pre.iterator();
    while(it.hasNext()) {
      first = it.next();
      if (it.hasNext()) {
        second = it.next();
      }
      int lo = first.lo;
      int hi = second == null ? first.hi : second.hi;
      int min = second == null ? first.min : Math.min(first.min, second.min);
      nodesByLevel[l].add(new Node(lo, hi, min, first, second));
    }
  }
  root = nodesByLevel[level];
}

int min(int lo, int hi) {
  return min(root, lo, hi+1);
}

// internally we use [lo, hi)
int min(Node node, int lo, int hi) [
  if (node == null) return Integer.MAX_VALUE;
  if (!isOverlapping(node, lo, hi)) {
    return Integer.MAX_VALUE;
  }
  else if ( isCovered(node, lo, hi) ) {
    return node.min;
  }

  // assert node range partially overlaps with query range
  // or query range covered by node range
  int leftMin = min(node.left, lo, hi);
  int rightMin = min(node.right, lo, hi);

  return Math.min(leftMin, rightMin);
}

// is the node range overlapping with [lo, hi)
boolean isOverlapping(Node node, int lo, int hi) {
  assert node != null;
  return node.hi > lo && node.lo < hi;
}
// Is the range of the node covered by [lo, hi)
boolean isCovered(Node node, int lo, int hi) {
  assert node != null;
  return lo <= node.lo && hi >= node.hi; 
}

First, the preprocessing is in O(n), because the tree has O(n) nodes, and to create each node, we only need constant time. To see the query is in O(log(n)) is more involved. If a node range is fully covered by a query or disjoint with a query, then it directly returns and incurs no additional cost. The problem arises when a query partially overlaps with both children of a node, which perform two recursions. The main observation is a query can partially overlaps with two children of a node only once, after which only one out of disjoint or fully covered can happen. More intuitively, the query processing navigates the summary tree to find the two ends without exploring internal nodes. As the tree is of height O(log(n)), the query is also O(log(n)).

LOB in Hibernate over DB2 9.5

2009-03-17T19:31:00.003+01:00

Using BLOB/CLOB in hibernate may cause exceptions complaining "LOB closed" when running on top of DB2 (see Hibernat Users FAQ). The reason can be the progressive streaming feature of DB2.

There are two solutions:

Use iterate() of ResultSet and load LOB content during each iteration.
Disable progressive streaming by specifying options in the JDBC url:
jdbc:db2://localhost:50000/DatabaseName:fullyMaterializeLobData=false;progressiveStreaming=2;progresssiveLocators=2;
Please note: fullMaterializeLobData is set to false, which enables lazy loading.

Two ways to make bad documentation

2008-11-13T11:07:00.004+01:00

Since our group are going to switch to a modern Content Management System (CMS), I am investigating candidate products. Two promising choices are Apache Lenya and Plone. I have to say that going deep into neither of the two is pleasing.

Lenya suffers from lacking documentation. Till now, not a single book is published about it. Furthermore, it is migrating to version 2.0, which renders most of existing documentation useless. At the time of writing, there is no complete manual on the Lenya site. However, because of the conceptual simplicity, Lenya is easier to understand in my opinion. The use of Apache Cocoon also adds to the value of Lenya.

It is just the other side for Plone. We are just flooded by documentations, which are full of specific terminology known only inside the community. Again, almost all the existing documentations are out of date for Plone 3. Furthermore, i have a feeling that the document authors are not really serious programmers. They tend to use quite informal and inaccurate descriptions. Maybe that is the way the supporting staff make money. Plone gives me a feeling of SAP, which always requires 2nd development that is more interesting to service staff than to users.

Lessons learned:
- documentation should be up to date and understandable:
There should be at least one entry level sample that can be run. If some special terms really have to be used, define them first. Finally, combining documentations for both front-end users and developers would just be confusing.
- Simplicity is much better than complexity
There should be some overview over the architecture or main mechanism. Although a framework can be rather complicated, there should be some central ideas that are so simple that humans can bear in mind.

My Visit to Philips Research

2008-02-12T12:43:00.001+01:00

My most valuable experience from the visit is not about the technical problems being solved. In fact two of three labs i visited were dealing with some medical image processing, operated by physical scientists. Although radioactivity could still be understood, a pair of photons was too distant from me. The remaining project was trying to aim stroke rehabilitation using sensors and computers. During the lunch served around 1pm, we got opportunity talking with those senior researchers with background in physics. It was fun talking with them. A person who has received enough scientific training behaves somehow differently, when they consider problems. It is independent on any particular knowledge, but the insight on solutions, quick understanding of questions, and pragmatic way of problem solving.

Hooray For Mylyn : Task-focused Development in Eclipse 3.3 (Europa)

2007-07-02T14:44:00.002+02:00

Last Friday, Eclipse 3.3 (code name Europa) was released. As an enthusiast for Eclipse Framework, i cannot wait take a first taste. From my point of view, the most attractive feature, besides other numerous improvements, is the integration of Mylyn, a task focus tool. Some opinions are to be given in the rest of this review, which are NOT intended as an tutorial of this tool. Anyone searching for a tutorial can see [1, 2]. A few high level characteristics of Mylyn are going to be discussed, together with some thinking why it is beneficial.

Mylyn in 3 Aspects
Mylyn is currently described as "task focused UI" officially [4]. However, it is much more than UI. Benefits provided by the tool can be categorized into three aspects:
Task-focus UI Overlay
A task is something like a bug or feature requirement, which is usually the unit of work programmers face with everyday. The motivation of task-focus is to easy programmers out of so-called information overloading, which is the case when lots of time is spent solely on navigating and searching for relevant elements(e.g., packages, classes, aspects, or methods) due to numerous projects in workspace or large enterprise projects. The idea of task focus, inspired by Aspect Oriented Methodology [5], tries to separate out elements (concerns of tasks) , which are usually scattered throughout project(s), into units of tasks. Furthermore, the seamless integration with Eclipse GUI makes Mylyn an undoubted success. The use of Mylyn can be pervasive in Europa. You can find it in, just to name a few, package browser, java editor, synchronizing view, context menu, and etc. All the visual elements have an overlay of task-focus, which mainly does filtering and folding, leaving only elements relevant with current activated task.
Adaptive Relevance Discovery
The UI overlay works all fine, in presence of task relevance data that maintains all elements relevant with each task. Then, where does such information come from? Mylyn employs a quite smart way: all user interactions are intercepted and recorded. Thus when a task is activated, the knowledge of relevance can be learned gradually. Intuitively, the relevance can be measured simply by counting the references(selecting and editing), while decaying old statistics over time. In fact, Mylyn adopted a tree-structured degree-of-interest model [5]. Since the knowledge of relevance learned by recording programmer activity, is valuable as domain expertise. For example, if someone has to work with legacy issues originally handled by others,
it would be just too nice to have the relevance data available. That is what Mylyn does: relevance data (called context of task) can be stored and shared via issue tracking systems (see the next section). We call this type of explicit management of domain expertise expertise as asset. It is especially valuable for teamworks (e.g., community driven projects), in which knowledge transfer is expensive.
Tracking Repository Integration
It is always good to stand on the shoulder of giants. Bug/Issue tracking systems are necessary components of most of the open-source projects today. Mylyn is finely integrated with systems such as Bugzilla, Trac, and JIRA. It provides nice GUI based ticket tracking repository access from IDE. Operations such as query, task submission, automatic notification and offline editing are supported. According to our experience, integration with IDE not only speed up productivity, but also encourages use of tracking system in internal projects ( no requirement for public ticket submission). As we stated earlier, Mylyn stores the context (relevance data) in Bugzilla, and allows patches for contexts. Therefore, expertise is not only shared but also co-worked on.

Lessons Learned
Separate Concerns: it is our old friend. From procedures to classes, from scattered crosscutting codes to aspects, we are always employing the divide-and-conquer again and again. Although algorithmically divide-and-conquer does not always lead to better solution, it is almost always good to decompose problems to reasonable sub-tasks in software engineering.
Expertise as Asset: reuse is really a main issue in improving productivity and reproductivity of a business. For code reuse we have components/libraries; for design reuse we have Model Driven Architecture (model as asset) ; now we have expertise reuse, not for automation but for knowledge transfer or documentation. Code communicates, but easier form of communication is even better.
Integration Is Good: interoperability between softwares or data repositories is long asking for a solution. The result of successful interoperability usually turns out to be integration. Data Warehouse, Enterprise Application Integration, Service Oriented Architecture can all be deemed as replies to answer such requirements.

Conclusion
I get excited over the appearance of task-focused UI and believe it will eventually become standard for IDE's to acquire such functionalities.

References

[1] Mik Kersten Task-focused programming with Mylar. Available at http://www-128.ibm.com/developerworks/java/library/j-mylar1/

[2] Mylyn User Guide. Available at http://wiki.eclipse.org/index.php/Mylyn_User_Guide

[3] Mik Kersten, Rob Elves and Gail Murphy. WYSIWYN: Using Task Focus to Ease Collaboration. CSCW Workshop, 2006.

[4] Mylyn Project. http://www.eclipse.org/mylyn/index.php

[5] Mik Kersten and Gail Murphy. Mylar: a degree-of-interest model for IDEs. Aspect Oriented Software Development, 2005.

A Typical Anti-pattern of using SAX

2007-01-22T14:00:00.000+01:00

Recently I just encountered a small problem with parsing XML documents with SAX. As is well known, the handler interface in SAX has three major call-backs for elements, namely startElement, endElement, and characters . The first two methods are called starting/ending a tag, while the third method is aimed at handling character data between tags. The problem lies in the contract of the method characters(char[] ch, int start, int length) . In the specification of SAX 2.0, it is stated that implementations of parsers are not required to ensure encapsulating one string between two corresponding tags in one call of this method.

Thus in order to get the characters between tags, be careful to buffer all the chunks through possibly several calls of characters() . What's more, do build guard boolean variables to collect the characters belonging to one element. Otherwise "\n" between tags might be also collected.

Retrospects

Longest Text Producible in Editor

Problem

Algorithm

Code

Fast Exponentiation

Triangle in Undirected Graph

Problem

Algorithm

Code

Minimal Truncation

Problem Description

Algorithm Development

Code

Rang Minimum Query

1.Disjoint Segment Algorithm \(O(n)-O(\sqrt{n})\)

2. Variable Length Summary Algorithm \(O(nlog(n))-O(1)\)

3. Hierarchical Summary Algorithm \(O(n)-O(log(n))\)

LOB in Hibernate over DB2 9.5

Two ways to make bad documentation

My Visit to Philips Research

Hooray For Mylyn : Task-focused Development in Eclipse 3.3 (Europa)

A Typical Anti-pattern of using SAX