Wednesday, November 19, 2008

kernel upgrade to 2.6.27 for mips/arm platform

I have encountered other interesting issue when I was upgrading the kernel from 2.6.23 to 2.6.27 for our MIPS32 platform.

1. the timer, as in 2.6.24 or later, the default MIPS timer has been separated from the old timer code. For our architecture used the default cp0 comparison based timer, we need to enable the r4k timer to get the timer working. I spend couple of days to understand this change. The start_kernel function was able to proceed after the proper is enabled. However, we do have a timer block in our chip, and we can set the timer to a certain frequency as an individual timer source. In my later debugging process, I have enabled the timer block and use our own timer source. It also works.

2. the cache should be disabled when kernel started and re-enable later on. In 2.6.23, the cache was enabled by default. However, in 2.6.27 or some version later than 2.6.23, the cache was configurable by a kernel option "cca=" and it is disabled by default. This change really hurts me. As there are so many changes from 23 to 27 kernel, it is almost impossible for me to notice this change at first. What I have observed at first was the slowness of the system. The BogoMips dropped from about 273 to 3, which is unbelievable. I was doubting the correctness of the timer function at the beginning. I scrutinized the code and well studied the new timer implementation. I even implemented our own timer by using the timer block in our chip. Those doesn't help either. The system was able to boot to busybox but it is really slow. I accidentally tried to use our performance counter program to measure the performance. The performance counter reported the cache hit is 0, which means that there is no cache enabled. I checked our private i/d cache register and they seems enabled. However, I forget to check the setting of the cp0 register of MIPS. There is another setting to enable/disable cache policy. I used a very stupid and old method to pinpoint the problem. I added NOP test to both 23 and 27 kernels. In 23 kernel, when the cache is enabled, the NOP test gives much lower CPI (clock per instruction), otherwise the CPI is high. In 27 kernel, the CPI doesn't change. I tried to figure out the exact point where the CPI drop 23 kernel and check the corresponding code in 27 kernel. I finally found that the default cache policy was disable in 27 kernel, while it is enabled in 23 kernel. By adding the "cca=3" kernel command line option, everything backs to normal, BogoMips, kernel boots properly.


3. Export symbol and export symbol gpl'ed. If your driver, kernel module used the latter symbols, your driver/kernel module must be gpl'ed. This can cause problem for us as we don't want to open source all our kernel modules, especially wlan driver. We deliver binary kernel module for our wlan drivers.

Tuesday, November 18, 2008

kernel upgrade to 2.6.27

I had just fixed a network driver bug when I upgrade the kernel from 2.6.23 to 2.6.27 for our MIPS platform. It takes couple of weeks. The original problem appears when the NAPI was used in network driver and I changed the net_poll function accordingly. Then, I got kernel panic with memory access failure. After long time debugging, I found that there is some problem when the driver tries to figure out the address of skb out of the skb->data structure. This is weird because the same code was used in both 2.6.23 kernel and 2.6.27 kernel. The original author of the network driver gave some hints that he had experienced similar problem when he was creating a network driver for our next generation chip, based on the old driver. He mentioned that the original driver was confused about the physical/virtual address when accessing the dma'ed memory. This is quite helpful. I spend a whole day dig into this issue, and studied the new driver for next generation chip. After replacing the memory allocation function for the skb buffer, I finally got the proper method to access the memory. I've learned about the physical/virtual/bus address when accessing memory in kernel. The principle was simple as stated by Linus, "use virtual address when accessing memory in kernel, and use bus address when the memory was given to device". In some architecture, the bus address is identical to physical address. Never use physical address directly. The functions: phys_to_virt, virt_to_bus, bus_to_virt, virt_to_phys, are all helper functions.

It seems that we still had a lot of bugs in our network driver. Apparently, our engineers haven't had enough knowledge creating drivers in Linux. Most of their experience was in VxWorks, with flat memory model.