2012-08-12

Java Performance Tuning

performance_tunning.org -->
* What I found in Performance Tuning
** BETTER MACHINE
- "we need more CPU core. we need more memory, currently 96G is not enough on our server. we need SSD"

** BETTER ALGORITHM
- using Array instead of HashMap
  || - ASN.1 parse
  || - There is a system, Every component has a uuid, but that's NOT efficient to dispatch calls.
  ||   So we assigned each component a numeric ID for fast dispatch.













** CHOOSE PROPER INITIAL SIZE
- StringBuilder
- HashMap && ConcurrentHashMap && ArrayList ...

- YOUNG & OLD generation's size
  















+ GC in JAVA



 GC for Eden Space:

GC for Old Generation:



 

+ HotSpot VM options: 
    -Xmn 
    -Xms256m -Xmx512m   ||Set initial and maximum to equal, to avoid using CPU cycle to grow heap size.
    -XX:NewRatio=n //Ratio of new/old generation sizes. The default value is 2.  
    -XX:SurvivorRatio=n //Ratio of eden/survivor spacesize. The default value is 8.  
    -XX:MaxTenuringThreshold=n
    -XX:PermSize -XX:MaxPermSize

    -XX:UseConcMarkSweepGC
-XX:ParallelGCThreads=n

+ If most of the application's data are short lived, you should expand size of YOUNG generation.
    otherwise, expand the size of OLD generation.
    find more GC options of HotSpot VM












+ About java GC GC Collectors
    ‐XX:+UseConcMarkSweepGC 

+ SoftReference, WeakReference
  + Use case of PhantomReference
  + Apple has a new Compile-time tech to avoid performance problem caused by GC: Automatic Reference Counting
   
















** LOW DOWN THE MEMORY CONSUMPTION
- String's memory structure in JVM
  - structure of Object in JVM
    + a normal object requires 8 bytes of "housekeeping" space;
        OpenJDK6: src/hotspot/src/share/vm/oops/markOop.hpp
        //  32 bits:
        //  --------
        //             hash:25 ------------>| age:4    biased_lock:1 lock:2 (normal object)
        //             JavaThread*:23 epoch:2 age:4    biased_lock:1 lock:2 (biased object)
        //             size:32 ------------------------------------------>| (CMS free block)
        //             PromotedObject*:29 ---------->| promo_bits:3 ----->| (CMS promoted object)
                
    + arrays require 12 bytes (the same as a normal object, plus 4 bytes for the array length).















- structure of String Object
    
    public final class String
    8        |String's object header
    4        |private final java.lang.Object value;--------->,
    4        |private final int offset;                      |
    4        |private final int count;                       |
    4        |private int hash;                              |
                                                             |
    8        |Char Array's object header  |<-----------------'   
    4        |array length                |
    N*2      |bytes of N characters       |
    P        |bytes of padding to 8n      |
   
    so, empty string will using 40 bytes.    
    In 64bits JVM, Object header will use 16bytes. 

    ||Using AnsiString, with byte[] rather than char[]

 
- string's substring implementation 
    ||why we need create a new String after substring.  //new String(str.substring(5, 4));

  - StringBuilder/StringBuffer use less memory. 
  - ||If not concern string contents in computation and caching, you can create your own String 
    || constant pool(HashMap) to convert frequently used String to Integer value.
    || when the string is never use anymore, convert it back.












- memory compress
  - snapp-java
  - Out of heap memory #find more
  - BigMemory

- Story about not use AtomicInteger - has better performance than synchronize adding (interlockedincrement __sync_fetch_and_add) - use more memory - IN THAT STORY: can be avoided through hash re-Dispatch - AtomicReference Java volatile reference vs. AtomicReference ( compareAndSet() used in Queue implementation)


























** TOOLS
*** jps  use jps to find correct jvm process Id

   
C:\Documents and Settings\Administrator>jps -lmVv
   3672 org.nasa.marsrovers.Main -agentlib:jdwp=transport=dt_socket,suspend=y,address=localhost:4176 -Dfile.encoding=UTF-8 -Xbootclasspath:C:\Program Files\Java\jre6\lib\resources.jar;C:\Program Files\Java\jre6\lib\rt.jar;C:\Program Files\Java\jre6\lib\jsse.jar;C:\Program Files\Java\jre6\lib\jce.jar;C:\Program Files\Java\jre6\lib\charsets.jar
   3380 sun.tools.jps.Jps -lmVv -Denv.class.path=.;C:\Program Files\Java\jdk1.6.0_22\jre\lib;C:\Program Files\Java\jdk1.6.0_22\lib; -Dapplication.home=C:\Program Files\Java\jdk1.6.0_22 -Xms8m
   552  -Dosgi.requiredJavaVersion=1.5 -Xms40m -Xmx384m -XX:MaxPermSize=256m















*** jstack
    ||Find out current stack trace.
    ||Thread name will show in stack trace, that's why we need set name for thread and threadPool.

    + look stack pattern find bug
       ||DeadLock in webserver
          We reproduced three days. If we grab a thread dump before restart server,
          This bug will be super easy to find. same as .NET and C++.
          before restart server, first thing we need to do is collect more
           information (dump, memory usage, CPU usage)

       ||HashMap infinite loop

       ||Most thread blocked on LOG message queue's put()























 $ jstack -l 3672
        
2012-08-12 22:48:17
        Full thread dump Java HotSpot(TM) Client VM (17.1-b03 mixed mode):
         
        "Low Memory Detector" daemon prio=6 tid=0x16bb2800 nid=0x1314 runnable [0x00000000]
           java.lang.Thread.State: RUNNABLE
         
           Locked ownable synchronizers:
         - None
         
        "CompilerThread0" daemon prio=10 tid=0x16baf400 nid=0xbd4 waiting on condition [0x00000000]
           java.lang.Thread.State: RUNNABLE
         
           Locked ownable synchronizers:
         - None
         
        "JDWP Command Reader" daemon prio=6 tid=0x16bad000 nid=0xdc4 runnable [0x00000000]
           java.lang.Thread.State: RUNNABLE
         
           Locked ownable synchronizers:
         - None
         
        "JDWP Event Helper Thread" daemon prio=6 tid=0x16bab000 nid=0x16c8 runnable [0x00000000]
           java.lang.Thread.State: RUNNABLE
         
           Locked ownable synchronizers:
         - None
         
        "JDWP Transport Listener: dt_socket" daemon prio=6 tid=0x16ba8c00 nid=0x11d4 runnable [0x00000000]
           java.lang.Thread.State: RUNNABLE
         
           Locked ownable synchronizers:
         - None
         
        "Attach Listener" daemon prio=10 tid=0x16b99000 nid=0x168c waiting on condition [0x00000000]
           java.lang.Thread.State: RUNNABLE
         
           Locked ownable synchronizers:
         - None
         
        "Signal Dispatcher" daemon prio=10 tid=0x16bb3400 nid=0x1320 runnable [0x00000000]
           java.lang.Thread.State: RUNNABLE
         
           Locked ownable synchronizers:
         - None
         
        "Finalizer" daemon prio=8 tid=0x16b85000 nid=0x1678 in Object.wait() [0x16cff000]
           java.lang.Thread.State: WAITING (on object monitor)
         at java.lang.Object.wait(Native Method)
         - waiting on <0x029d0b28> (a java.lang.ref.ReferenceQueue$Lock)
         at java.lang.ref.ReferenceQueue.remove(Unknown Source)
         - locked <0x029d0b28> (a java.lang.ref.ReferenceQueue$Lock)
         at java.lang.ref.ReferenceQueue.remove(Unknown Source)
         at java.lang.ref.Finalizer$FinalizerThread.run(Unknown Source)
         
           Locked ownable synchronizers:
         - None
         
        "Reference Handler" daemon prio=10 tid=0x16b83c00 nid=0x13ec in Object.wait() [0x16caf000]
           java.lang.Thread.State: WAITING (on object monitor)
         at java.lang.Object.wait(Native Method)
         - waiting on <0x029d0a28> (a java.lang.ref.Reference$Lock)
         at java.lang.Object.wait(Object.java:485)
         at java.lang.ref.Reference$ReferenceHandler.run(Unknown Source)
         - locked <0x029d0a28> (a java.lang.ref.Reference$Lock)
         
           Locked ownable synchronizers:
         - None
         
        "main" prio=6 tid=0x00847000 nid=0x1464 runnable [0x0091f000]
           java.lang.Thread.State: RUNNABLE
         at java.io.FileInputStream.readBytes(Native Method)
         at java.io.FileInputStream.read(Unknown Source)
         at java.io.BufferedInputStream.read1(Unknown Source)
         at java.io.BufferedInputStream.read(Unknown Source)
         - locked <0x029e19f0> (a java.io.BufferedInputStream)
         at sun.nio.cs.StreamDecoder.readBytes(Unknown Source)
         at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
         at sun.nio.cs.StreamDecoder.read(Unknown Source)
         - locked <0x02b63a00> (a java.io.InputStreamReader)
         at java.io.InputStreamReader.read(Unknown Source)
         at java.io.BufferedReader.fill(Unknown Source)
         at java.io.BufferedReader.readLine(Unknown Source)
         - locked <0x02b63a00> (a java.io.InputStreamReader)
         at java.io.BufferedReader.readLine(Unknown Source)
         at org.nasa.marsrovers.simulator.Simulator.startUp(Simulator.java:53)
         at org.nasa.marsrovers.Main.main(Main.java:50)
         
           Locked ownable synchronizers:
         - None
         
        "VM Thread" prio=10 tid=0x16b81400 nid=0xef0 runnable 
         
        "VM Periodic Task Thread" prio=10 tid=0x16bc9000 nid=0x304 waiting on condition 
         
        JNI global references: 1508



*** jmap
**** show memory usage 08:57 ~ $ jmap -heap 623
Attaching to process ID 623, please wait...
        Debugger attached successfully.
        Server compiler detected.
        JVM version is 20.6-b01-414
        
        using parallel threads in the new generation.
        using thread-local object allocation.
        Concurrent Mark-Sweep GC
        
        Heap Configuration:
           MinHeapFreeRatio = 40
           MaxHeapFreeRatio = 70
           MaxHeapSize      = 838860800 (800.0MB)
           NewSize          = 21757952 (20.75MB)
           MaxNewSize       = 174456832 (166.375MB)
           OldSize          = 65404928 (62.375MB)
           NewRatio         = 7
           SurvivorRatio    = 8
           PermSize         = 21757952 (20.75MB)
           MaxPermSize      = 367001600 (350.0MB)
        
        Heap Usage:
        New Generation (Eden + 1 Survivor Space):
           capacity = 19595264 (18.6875MB)
           used     = 16274240 (15.52032470703125MB)
           free     = 3321024 (3.16717529296875MB)
           83.0519047867893% used
        Eden Space:
           capacity = 17432576 (16.625MB)
           used     = 15573992 (14.852516174316406MB)
           free     = 1858584 (1.7724838256835938MB)
           89.33844315378289% used
        From Space:
           capacity = 2162688 (2.0625MB)
           used     = 700248 (0.6678085327148438MB)
           free     = 1462440 (1.3946914672851562MB)
           32.37859552556818% used
        To Space:
           capacity = 2162688 (2.0625MB)
           used     = 0 (0.0MB)
           free     = 2162688 (2.0625MB)
           0.0% used
        concurrent mark-sweep generation:
           capacity = 154337280 (147.1875MB)
           used     = 120364648 (114.7886734008789MB)
           free     = 33972632 (32.398826599121094MB)
           77.98805836153132% used
        Perm Generation:
           capacity = 176562176 (168.3828125MB)
           used     = 134844944 (128.59815979003906MB)
           free     = 41717232 (39.78465270996094MB)
           76.37249781062961% used







**** show memory usage according object type
        09:02 ~ $ jmap -histo 623 | head -n 20
        
num     #instances         #bytes  class name
        ----------------------------------------------
           1:         57975       34551008  [B
           2:        304951       33637464  [C
           3:        166951       23748152  <constMethodKlass>
           4:        166951       22721672  <methodKlass>
           5:         21247       21762832  <constantPoolKlass>
           6:        279214       18444400  <symbolKlass>
           7:         21247       17749960  <instanceKlassKlass>
           8:         20092       13129152  <constantPoolCacheKlass>
           9:        304840        9754880  java.lang.String
          10:         53366        4497624  [Ljava.lang.Object;
          11:         15038        4360504  [I
          12:        122304        3913728  java.util.HashMap$Entry
          13:         22426        2332304  java.lang.Class
          14:          4146        2094728  <methodDataKlass>
          15:         32534        1717344  [S
          16:         12637        1676168  [Ljava.util.HashMap$Entry;
          17:         35117        1544960  [[I

**** get a JVM heap dump
        $ jmap -dump:file=/tmp/demo.map 91440
        Dumping heap to /private/tmp/demo.map ...
        Heap dump file created
or 
      -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=C:/temp/oom.hprof

**** analyze heap dump
     
     +   $ jhat /private/tmp/demo.map 
        Reading from /private/tmp/demo.map...
        Started HTTP server on port 7000
      

+  Like !dumpheap -type Exception in windbg+sos, you can use OQL to find out 
        more useful stuff in the dumped heap, like:
select file.path.toString() from java.io.File file
**** using eclipse MAT http://www.eclipse.org/mat to analyze the dump file. 
- articles about MAT
     











**** Monitor GC status of JVM
  - jstat

     $ jstat -gcutil 21891 250 7
     S0     S1     E      O      P     YGC    YGCT    FGC    FGCT     GCT
    12.44   0.00  27.20   9.49  96.70    78    0.176     5    0.495    0.672
    12.44   0.00  62.16   9.49  96.70    78    0.176     5    0.495    0.672
    12.44   0.00  83.97   9.49  96.70    78    0.176     5    0.495    0.672
     0.00   7.74   0.00   9.51  96.70    79    0.177     5    0.495    0.673
     0.00   7.74  23.37   9.51  96.70    79    0.177     5    0.495    0.673
     0.00   7.74  43.82   9.51  96.70    79    0.177     5    0.495    0.673
     0.00   7.74  58.11   9.51  96.71    79    0.177     5    0.495    0.673







































- visualVM
    

    When CPU high, can use visualVM find out the most long run method.
    When CPU low, can get some stack sample, see where is the block point.
    queue.put? queue.take? 

  - nmon  great tool to monitor CPU, memory, network, disks...


















** OTHERS
*** JAVA AS SCRIPT
  aim:     write business calculation/logic to script
  problem: java call Groovy/Python was so slow.

  - write business calculation/logic to file XXX.java, then compile and load use customized class loader.
  - can reload java class, after business calculation/logic changed, without restart Java instance.
  - use script to trace data




































*** NETWORK
  - ||CPU balance for network interfaces.  
    For some machine with multi CPU, if not properly configured, 
    all network interfaces' interrupts will goto CPU0 (like following show). 
    Then network performance will be restricted by power of CPU0.  see here for the solution.

      $ cat /proc/interrupts
              CPU0       CPU1       CPU2       CPU3
     65:      20041      0          0          0      IR-PCI-MSI-edge  eth0-tx-0
     66:      20232      0          0          0      IR-PCI-MSI-edge  eth0-tx-1
     67:      20105      0          0          0      IR-PCI-MSI-edge  eth0-tx-2
     68:      20423      0          0          0      IR-PCI-MSI-edge  eth0-tx-3
     69:      21036      0          0          0      IR-PCI-MSI-edge  eth0-rx-0
     70:      20201      0          0          0      IR-PCI-MSI-edge  eth0-rx-1
     71:      20587      0          0          0      IR-PCI-MSI-edge  eth0-rx-2
     72:      20853      0          0          0      IR-PCI-MSI-edge  eth0-rx-3
























- set_thread_affinity
    ||If your system's performance is not stable, you can try set_thread_affinity.
    Restricting a process to run on a single CPU also avoids the performance cost 
    caused by the cache invalidation that occurs when a process ceases to execute
    on one CPU and then recommences execution on a different CPU.
    
int sched_setaffinity(pid_t pid,size_t cpusetsize,cpu_set_t *mask);
- send several packages in a Big package. - pfring - Google Protocol Buffer

*** thread local storage to prevent new objects
JAVA: ThreadLocal
    gcc: __thread int i;

 - use array if possible
 - use hash dispatch to avoid locking
 - log sampling, if log is not important, then use queue.offer()
 - keep thread count low. 
 - use message loop instead of timers
 - less exception
 - double buffer queue




** THE END











_______________________________________________________________________
_______________________________________________________________________