Convert Chinese LaTeX Source to HTML and PDF

I am going to write a book in Chinese. I hope that I can publish its chapters on my blog, so I can get feedback before it is printed. This requires that I can convert my manuscript into HTML format (for publishing in my blog) and PDF format (for printing).

I tried to write in Emacs Org mode, Wiki and Markdown. However, none of them support equatons well. So I decided to use LaTeX.

I tried several tools to convert LaTeX source into HTML, including htlatex and pandochtlatex does not support Chinese well, and pandoc supports only few LaTeX syntax. Finally, I decided to use Hevea, which works good to me.

I use XeLaTeX to convert LaTeX to PDF. Compared with PDFLaTeX, XeLaTeX works better with UTF-8 and TrueType Chinese fonts.

However, Hevea and XeLaTeX have different requirements with the preamble of LaTeX source. So I created tempaltes for them respectively. These templates use LaTeX’s \input directive to include a LaTeX source file containing the real text.

An example project is at https://github.com/wangkuiyi/hevea-xelatex.

Configure an HDFS for Development/Testing

I am using the Go implementation of WebHDFS interface: https://github.com/vladimirvivien/gowfs. In order to test it, I need to set up an HDFS on my development computer (Mac OS X 10.8, Hadoop-2.2.0). The author Vladimir Vivien reminded two properties to enable WebHDFS:

  1. Enable dfs.webhdfs.enabled property in hdfs-site.xml
  2. Ensure hadoop.http.staticuser.user property is set in your core-site.xml.

However, those are not enough. If you see error messages like the following reported by the append operation:

Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try.

You need to add the following properties in hdfs-site.xml

  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>

  <property>
    <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>
    <value>false</value>
  </property>

Editing CSV Files Using Emacs

CSV (comma-separated values) files are everywhere, though sometimes field are separated by tabs or spaces instead of comma. I see many people edit CSV files using spreadsheet software like Microsoft Excel. I edit CSV files using Emacs, so I do not have leave my programming environment.

Emacs can recognize CSV files if you installed the csv-mode: http://www.emacswiki.org/emacs-fr/download/csv-mode.el. Just download it and save it anywhere, and add the following line into your ~/.emacs file:

(load-file "/path/to/csv-mode.el")

Before you can edit your CSV file, make sure values were separated by comma. If they are not, the following simple command-line can help:

cat your_file | sed 's/\t/, /g' > your_file.csv

This command converts every tab in your_file into a comma and a space. The result file, your_file.csv, can be recognized by csv-mode now.

After opening your_file.csv using Emacs, you might want to use M-x toggle-truncate-lines to disable the warping of long lines. Then, you can use M-x csv-align-fields to align fields. This makes the file look like it is in Microsoft Excel.

Screenshot

How to Sample a Dirichlet-Multinomial Distribution

Consider the problem of sampling from a multinomial distribution Mult(\vec{x}|\vec{p}, n), where \vec{p} is sampled from a Dirichlet prior distribution Dir(\vec{p}|\vec{\alpha}).

A conceptually straight-forward solution is to sample \vec{p} from Dir(\vec{p}|\vec\alpha), and then to generated $\latex n$ samples from the discrete distribution defined by \vec{p}. As described by Wikipedia, sampling \vec{p}=\{p_1,\ldots,p_K\} can be done by drawing samples \{y_1,\ldots,y_K\} from K Gamma distributions: y_k \sim \Gamma(\alpha_k, 1) \text{,  } k\in[1,K], and then get \vec{p} by normalizing y_k: p_k = y_k/(\sum_k y_k). According to Wikipedia, if \alpha_k is a positive integer, we have \sum_{i=1}^{\alpha_k} - \log U_i \sim \Gamma(\alpha_k, 1), where U_i is a sample drawn from the uniform distribution over (0, 1]. However, if \alpha_k‘s are not positive integers, sampling Gamma would become a complex procedure.

Even if we can implement the algorithm that draws samples from Gamma and then Dirichlet, this algorithm would not be numerically robust. Consider that when U_i is close to 0, \log U_i would be Inf. Another dangerous point is that if we get successively K y_k=0, p_k=y_k/(\sum y_k) would lead to either divide-by-zero interrupt or make p_k NaN.

Fortunately, we can make use of the conjugacy between Dirichlet and multinomial. This conjugacy, as explained in the textbook Pattern Recognition and Machine Learning, states that \alpha_k is the prior number of observations of the multinomial output $k$. This leads to the following simple sampling method, which can be generalized further to sample from Dirichlet processes:

  1. \vec{p} = \vec\alpha, i = 0
  2. k \sim Discrete(\vec{p})
  3. p_k = p_k+1, x_k=x_k+1, i=i+1
  4. while i < n, goto 2.

Full Go code is as follows:

func sampleDirichletMultinomial(alpah []float64, n int, rng *rand.Rand) []int {
	dist := make([]float64, len(alpha))
	copy(dist, alpha)
	hist := make([]int, len(alpha))
	for i := 0; i < n; i++ {
		k := sampleDiscrete(dist, rng)
		dist[k] += 1.0
		hist[k]++
	}
	return hist
}

func sampleDiscrete(dist []float64, rng *rand.Rand) int {
	if len(dist) <= 0 {
		panic("sample from empty distribution")
	}
	sum := 0.0
	for _, v := range dist {
		if v < 0 {
			panic(fmt.Sprintf("bad dist: %v", dist))
		}
		sum += v
	}
	u := rng.Float64() * sum
	sum = 0
	for i, v := range dist {
		sum += v
		if u < sum {
			return i
		}
	}
	panic("sampleDiscrete gets out of all possiblilities")
}

Install GDB from Source on Mac OS X

It is OK to follow this tutorial to build GDB from source code:

  https://github.com/sirnewton01/godbg

 
But we need to apply a patch before ./configure and make as described in above link:
 
  cd gdb-7.7
  patch < ~/Download/patch // here we need to specify the file to be patched. It is bfd/mach-o.c
  ./configure –prefix=/Users/yiwang/usr –disable-dynamic –enable-static –enable-expact –enable-python
  make -j8
  make install

 

Extract Text from PDF Files

I got this solution from Stackoverflow.

A more comfortable way to do text extration: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:

 pdftotext \
   -f 13 \
   -l 17 \
   -layout \
   -opw supersecret \
   -upw secret \
   -eol unix \
   -nopgbrk \
   /path/to/your/pdf
   - |less

This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less…

Removing control-Ms (^M) in Text File using Sed

When I extract text from a database or a PDF file (using xpdf’s pdftotext), I got fields or words suffixed with special character ^M. Note that these ^M’s appear not only at the end of lines. I use the following command line to remove these annoying ^M’s:

sed ‘s/^M//’ my-text-file

where the ^M in above shell command line comes by pressing Ctrl-V and then Ctrl-M.

Install YARN on Mac OS X

I’d been bothered by a whole bunch of problems when I tried to install Hadoop 2.2.0 (YARN) on my Mac OS X system. Thanks to Alex JF for the tutorial which works with both Linux and Mac OS X.

Programming Qt 5 Using Go

Salviati’s go-qt5 project on Github makes it possible to write GUI programs using Qt 5 and Go.

On Mac OS X Mavericks, I tried to build and run some Qt 5.2 programs written in Go. Here follows what I did:

  1. Install Qt 5.2 using Homebrew.
    brew update && brew doctor && brew install qt5
    This would warn you that Qt 4 is more widely used than Qt 5. Anyway, go-qt5 is a Go binding of Qt 5.
  2. Checkout go-qt5.
    mkdir -p /home/you/go-qt5
    export GOPATH=/home/you/go-qt5
    go get github.com/salviati/go-qt5
  3. Build go-qt5
    You can follow the README file on https://github.com/salviati/go-qt5 to build go-qt5 and some example programs. Since Homebrew does not create symbolic links for Qt 5, you need to invoke qmake as /usr/local/Cellar/qt5/5.2.0/bin/qmake. Also, you need to put $GOPATH/src/github.com/salviati/go-qt5/lib/libgoqt5drv.1.0.0.dylib together with your Go binaries in the same directory before you can execute the Go binaries.

RPC in Go: The Client Side

RPC in Go: The Client Side

Conceptually, each client maintains a connection to the server, on which, encoded requests and responses are transmitted, and a pending list, which maintains requests already sent out and waiting for responses.

In order to match responses with pending requests, every call is assigned a monotonically increasing sequential number. The pending list is in fact a mapping from the sequential number to a rpc.Call struct.

Every client has a goroutine that collects responses and match them with pending requests. If correctly matched, the goroutine notifies the caller about the completion and remove the matched request from the pending list.

The modification of the pending list is protected by mutex rpc.Call.mutex, thus avoids race condition causes by parallel request sending and response receiving.

The increment of the sequential number is protected by another mutex rpc.Call.sending, so concurrent sending does not make any confusion about the sequential number.

Establish the Connection

We can establish a TCP connection between the client and the server by calling rpc.Dial, which invokes net.Dial to establish a connection and calls rpc.NewClient to wrap up this connection as an RPC client, an rpc.Client struct.

We can also establish a TCP connection by using the CONNECT in HTTP protocol. This is done by calling rpc.DialHTTP, which invokes rpc.DialHTTPPath withDefaultRPCPath="/_goRPC"rpc.DialHTTPPath then invokes net.Dial to establish a connection and sends an HTTP request whose method is CONNECT and URI isDefaultRPCPath. The RPC server, which must be a HTTP server, should understand this request is to make a TCP connection for later RPC calls, it should keep that connection alive. Once the connection is established, a call to rpc.NewClient wraps up the connection by a client.

Encoding and Decoding

rpc.Client struct contains a member rpc.Client.codec with type rpc.ClientCodec, which wraps up the network connection. It encodes all requests and decodes all responses.

Client created by rpc.NewClient has a rpc.gobClientCodec codec, an implementation of the rpc.ClientCodec interface. It is also possible to specify another implementation by creating the client using rpc.NewClientWithCodec. The rpc.Dial* methods callrpc.NewClient, but rpc.NewClientWithCodec was called by jsonrpc.NewClient.

Indeed, rpc.NewClient invokes rpc.NewClientWithCodec, and the latter, before returning the client object, invokes method go client.input, which recieves respones of pending requests and notifies callers the completion of their calls.

  • Seems that there is not a CreateServerWithCodec on the server side. So how should I write a JSON RPC server?

Make Calls

Conceptually, every RPC call consists of the name of service and method, an argument and the reply. More than that an error might occur during the call and a done channel is used to notify the completion of the call. All these are described by the rpc.Call struct.

To make a call, we can all rpc.Client.Go, which requires the service/method name, the argument, the holder of reply, and the done channel. rpc.Client.Go encapsulate all these inputs into a rpc.Call struct, and gives it to the call to rpc.Client.send.

The rpc.Call struct has a method done, which, when invoked, notifies the completion of the call by sending the rpc.Call strucut itself to the done channel, which is of type chan *Call. As the done channel is provided by the caller, the caller is able to know the reply or the error once it reads an rpc.Call struct out from the channel.

Calling Patterns

The caller can provide a done channel to multiple RPC calls, and waits to read all responses from the channel. This can be very useful in some cases.

Consider that we have a bunch of downstream services, and it is OK if we get response from any of them. We can make calls like:

var clients []rpc.Client          // to the N=10 services.    
done := make(chan * rpc.Call, 10) // buffer size >=10 to avoid unnecessary blocking.
for _, c := range clients {  // make calls to all these instances.
  c.Go("AService", arg, reply, done)
}
call := <- done  // blocks until any instance replies.

Another common case is to collect information from a bunch of servers in order to build a request for the next stage of RPC invocation. This can be done by changing the last line of above code:

var req ARequest
for _, call := range done {
  if call.Error != nil {
    log.Fatal(call.Error)
  }
  req.arg[call.ServiceMethod] = call.Reply
}
// Make another RPC call with req as the argument.

Sending Request

The method rpc.Client.send is a critical section protected by mutexrpc.Client.sending. The method checks if the client is currently under closing or had been shutdown. If so, it calls call.done to finish the call before transimitting it over the network connection.

Otherwise, it assign a new sequential number to the call. This sequential number is used as the key when it adds the call to the pending list. Then the sequential number, together with service/method name and the argument are written to the server by callingclient.codec.WriteRequest.

Receiving Response

The goroutine created by rpc.NewClientWithCodec (which is invoked by rpc.NewClient) runs rpc.Client.input is in charge of collecting responses and matching responses with pending requests.

The response consists of a header and a body. If errors occur during decoding the header,rpc.Client.input terminates itself. Otherwise, the matched call is removed from the pending list.

Otherwise, if the response has no matching pending request, there might had been something wrong with the call to rpc.ClientCodec.WriteRequest, and the body should be read the discarded. Or, if the response header notifies some errors on the server, the body should also be read and discarded. Only when everything is alright, the body is read and decoded intocall.Reply.

rpc.Client.input correctly invokes call.Done with either an error from the server or with the reply.

Client Codec

Above procedure also explains why the interface rpc.ClientCodec interface contains the following methods:

  • WriteRequest(r *Request, body interface{}) error
  • ReadResponseHeader(r *Response) error
  • ReadResponseBody(body interface{}) error
  • Close

Method WriteRequest encodes and writes rpc.Request, which contains the sequential number and service/method name, and the argument (as body).

Method ReadResponseHeader reads and decodes rpc.Response, which contains the sequential number, service/method name and a possible error.

Method ReadResponseBody reads and decodes the reply.