# String Deduplication

# 介绍

string deduplication即字符串去重是JDK在1.8开始提供的功能，目的是减少相同内容字符串的内存占用，相同内容是指equals为true的两个字符串。目前只有在使用G1收集器的情况下才能开启，默认不开启，开启方法为增加 -XX:+UseStringDeduplication启动参数。

# 实现机制

java.lang.String字符串对象的数据保存在内部的byte[] value即一个名为value的byte数组中

public final class String {
    private final byte[] value;
}

1
2
3

在一次gc之后，回收的heap区域中的活对象(live objects)都会被访问一遍，对于每个string对象，我们都会查看是否能够进行deduplication。判断规则是根据对象的age，大于一定年龄才进行deduplication，因为如果string对象是一个临时对象，进行deduplication没有太大价值。

如果确认一个字符串对象可以进行deduplication，则会把对象引用插入到一个队列中。

会有一个deduplication线程来处理这个队列，从队列中取出字符串对象，然后判断尝试对字符串中的byte数组进行去重。

内部有一个hashtable来追踪string对象的byte数组。

在去重的时候，会到hashtable中查找是否有相同的byte数组，如果有，会把后面的字符串的byte数组换成已有的相同的byte数组并释放掉自己原有的byte数组引用好让这块内存能够通过GC释放。如果在hashtable中没有找到相同的byte数组，则会把byte数组写入到hashtable中，hashtable对于byte数组的引用是弱引用避免内存泄露。

# deduplicate对象的选择条件。

对于young gc和mixed gc，判断一个对象是否要进行deduplicate的条件是

开启了-XX:+UseStringDeduplication
对象类型是String类型，判断对象的klass和_string_klass_or_null相同，如果没有开启-XX:+UseStringDeduplication，_string_klass_or_null是null所以这样都不会进行deduplicate
从young gen复制过来的对象
如果是复制到young gen(survivor)，则对象年龄即age需要等于-XX:StringDeduplicationAgeThreshold=3的配置值，默认是3；如果是复制到old gen（晋升的情况），则age需要小于StringDeduplicationAgeThreshold

最后的判断条件是什么目的呢？

如果对象晋升到old gen，则需要进行一次deduplicate，因为再不deduplicate就只能等待full gc时进行deduplicate了(mixed gc的时候即使选择了old gen的region也不会进行deduplicate)。判断age小于StringDeduplicationAgeThreshold是避免重复deduplicate，因为晋升到old gen的时候age不变，如果age已经大于等于StringDeduplicationAgeThreshold时了说明之前已经进行过deduplicate了。

// Candidate selection policy for young/mixed GC.
// If to is young then age should be the new (survivor's) age.
// if to is old then age should be the age of the copied from object.
static bool is_candidate_from_evacuation(const Klass* klass,
                                       G1HeapRegionAttr from,
                                       G1HeapRegionAttr to,
                                       uint age) {
return StringDedup::is_enabled_string(klass) &amp;&amp;
       from.is_young() &amp;&amp;
       (to.is_young() ?
        StringDedup::is_threshold_age(age) :
        StringDedup::is_below_threshold_age(age));
}

1
2
3
4
5
6
7
8
9
10
11
12
13

对于FullGC的情况，则判断开启了-XX:+UseStringDeduplication、对象是String类型并且在young gen中并且age小于StringDeduplicationAgeThreshold。

if (StringDedup::is_enabled() &amp;&amp;
  java_lang_String::is_instance_inlined(obj) &amp;&amp;
  G1StringDedup::is_candidate_from_mark(obj)) {
  _string_dedup_requests.add(obj);
}
bool G1StringDedup::is_candidate_from_mark(oop java_string) {
  // Candidate if string is being evacuated from young to old but has not
  // reached the deduplication age threshold, i.e. has not previously been a
  // candidate during its life in the young generation.
  return G1CollectedHeap::heap()->heap_region_containing(java_string)->is_young() &amp;&amp;
         StringDedup::is_below_threshold_age(java_string->age());
}

1
2
3
4
5
6
7
8
9
10
11
12

# deduplication queue

deduplication queue是GC线程和Deduplication线程之间通信的通道，GC线程在gc时，把可以deduplicate的String对象，放到队列中，Deduplicate线程不断从队列中获取待deduplicate的对象。每个GC线程都有自己的Requests对象，来减少线程间的冲突。

入队相关代码

在G1ParScanThreadState::do_copy_to_survivor_space方法中

// Check for deduplicating young Strings.
if (G1StringDedup::is_candidate_from_evacuation(klass,
                                                region_attr,
                                                dest_attr,
                                                age)) {
  // Record old; request adds a new weak reference, which reference
  // processing expects to refer to a from-space object.
  _string_dedup_requests.add(old);
}

1
2
3
4
5
6
7
8
9

# Deduplication Thread

deduplication线程是一个jvm内部的线程，和Java应用属于并发执行关系，通过jstack可以看到这个名为StringDedupProcessor的线程。

do_oop方法判断字符串是否为null、进行尝试去重、hashtable扩容、hashtable清理。

virtual void do_oop(oop* ref) {
    if (_processor->yield_or_continue(_joiner, Stat::Phase::process)) {
      oop java_string = NativeAccess&lt;ON_PHANTOM_OOP_REF>::oop_load(ref);
      release_ref(ref);
      // Dedup java_string, after checking for various reasons to skip it.
      if (java_string == nullptr) {
        // String became unreachable before we got a chance to process it.
        _cur_stat.inc_skipped_dead();
      } else if (java_lang_String::value(java_string) == nullptr) {
        // Request during String construction, before its value array has
        // been initialized.
        _cur_stat.inc_skipped_incomplete();
      } else {
        Table::deduplicate(java_string);
        if (Table::is_grow_needed()) {
          _cur_stat.report_process_pause();
          _processor->cleanup_table(_joiner, true /* grow_only */, false /* force */);
          _cur_stat.report_process_resume();
        }
      }
    }
  }
};

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

去重的具体处理，在hashtable中查找和当前字符串的value(byte数组)内容相同的byte array，如果找到，并且不是同一个数组对象，则对当前的字符串替换value引用。如果没找到，则将当前的value插入到hashtable中。

void StringDedup::Table::deduplicate(oop java_string) {
  assert(java_lang_String::is_instance(java_string), "precondition");
  _cur_stat.inc_inspected();
  if ((StringTable::shared_entry_count() > 0) &amp;&amp;
      try_deduplicate_shared(java_string)) {
    return;                     // Done if deduplicated against shared StringTable.
  }
  typeArrayOop value = java_lang_String::value(java_string);
  uint hash_code = compute_hash(value);
  TableValue tv = find(value, hash_code);
  if (tv.is_empty()) {
    // Not in table.  Create a new table entry.
    install(value, hash_code);
  } else {
    _cur_stat.inc_known();
    typeArrayOop found = cast_from_oop&lt;typeArrayOop>(tv.resolve());
    assert(found != nullptr, "invariant");
    // Deduplicate if value array differs from what's in the table.
    if (found != value) {
      if (deduplicate_if_permitted(java_string, found)) {
        _cur_stat.inc_deduped(found->size() * HeapWordSize);
      } else {
        // If string marked deduplication_forbidden then we can't update its
        // value.  Instead, replace the array in the table with the new one,
        // as java_string is probably in the StringTable.  That makes it a
        // good target for future deduplications as it is probably intended
        // to live for some time.
        tv.replace(value);
        _cur_stat.inc_replaced();
      }
    }
  }
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

# deduplication hashtable

deduplication hashtable需要记录所有请求去重的字符串的byte array数据，并且在没有引用之后清理。 hashtable和Java里的HashMap类似，也是Bucket数组组成，每个Bucket是一个逻辑上的链表，实际上是用动态长度的数组来代替链表节省空间（next指针等）

class StringDedup::Table : AllStatic {
private:
  class Bucket;
  class CleanupState;
  class Resizer;
  class Cleaner;
  enum class DeadState;

  // Values in the table are weak references to jbyte[] Java objects.  The
  // String's coder isn't recorded, even though it affects how String access
  // would interpret that array.  For the purposes of deduplication we don't
  // care about that distinction; two Strings with equivalent arrays but
  // different coders can be deduplicated to share a single array.  We also
  // can't depend on the coder value being correct here, since GC requests
  // can provide the deduplication thread with access to a String that is
  // incompletely constructed; the value could be set before the coder.
  using TableValue = WeakHandle;  // 弱引用

  // Weak storage for the string data in the table.
  static OopStorage* _table_storage;
  static Bucket* _buckets;
  static size_t _number_of_buckets;
  static size_t _number_of_entries;
  static size_t _grow_threshold;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24

class StringDedup::Table::Bucket {
  GrowableArrayCHeap&lt;uint, mtStringDedup> _hashes;
  GrowableArrayCHeap&lt;TableValue, mtStringDedup> _values;

1
2
3

在hashtable中元素超过threshold时，也会像HashMap一样扩容，数量变少时（清理没有引用的数据），也会进行缩容（shrink)

# string deduplication log分析

对于JDK17，通过增加-XX:stringdedup*=debug 对于JDK11，通过增加-XX:gc+stringdedup=debug 可以打开string deduplication的详细日志。

[15.950s][info ][stringdedup             ] Concurrent String Deduplication 1032/41424.0B (new), 168/5208.0B (deduped), avg 12.5%, 17.502ms of 24.439ms

1032/41424.0B (new)表示向hashtable中新增了1032个byte数组，总体大小41424.0B

168/5208.0B (deduped)表示本次对168个字符串的byte数组进行了去重，总体大小5208.0B

avg 12.5%表示从应用启动到现在，去重的字符串的占比。

17.502ms of 24.439ms 前面的17.502ms是_process_elapsed表示用于deduplication的时间，不包含
后面的24.439ms表示的是concurrent deduplication整体处理的时间，除了进行deduplication外还有cleanup_table（清理没有强引用的weakreference引用的byte array)


[15.950s][debug][stringdedup             ]   Last Process: 1/17.502ms, Idle: 1/2604.480ms, Blocked: 0/0.000ms
[15.950s][debug][stringdedup             ]   Last Cleanup Table: 1/6.764ms
[15.950s][debug][stringdedup             ]     Inspected:            1466
[15.950s][debug][stringdedup             ]       Known:               434( 29.6%)
[15.950s][debug][stringdedup             ]       Shared:                0(  0.0%)
[15.950s][debug][stringdedup             ]       New:                1032( 70.4%) 41424.0B
[15.950s][debug][stringdedup             ]       Replaced:             14(  1.4%)
[15.950s][debug][stringdedup             ]       Deleted:            2218(214.9%)
[15.950s][debug][stringdedup             ]     Deduplicated:          168( 11.5%)  5208.0B( 12.6%)
[15.950s][debug][stringdedup             ]     Skipped: 4 (dead), 0 (incomplete), 0 (shared)
[15.950s][debug][stringdedup             ]   Total Process: 8/139.331ms, Idle: 8/14716.313ms, Blocked: 0/0.000ms

Last这部分表示最近一次string deduplication。

Inspected: 1466表示这次处理了1466个字符串去重请求
Known: 434( 29.6%) 表示其中434个字符串中hashtable中有重复的byte array，29.6%表示占本次处理的字符串的比例是29.6%
Shared: 上共享内存的统计，一般用的比较少。
New: 1032( 70.4%) 41424.0B 表示hashtable中新增加了1032个byte array，总内存大小41424.0B
Replaced: 14(1.4%) 表示替换hashtable的数量，因为有的字符串不能替换value，则hashtable会替换value成新的value
Deleted: 2218(214.9%)表示删除了2218个byte array，是删除的没有引用的数据。
Deduplicated: 168( 11.5%)  5208.0B( 12.6%) Deduplicated表示给168字符串替换了value。Known和deduplicated的区别是Known中的有可能是相同的array数组对象，不用替换。

[15.950s][debug][stringdedup             ]   Total Cleanup Table: 1/6.764ms
[15.950s][debug][stringdedup             ]     Inspected:            9825
[15.950s][debug][stringdedup             ]       Known:              2826( 28.8%)
[15.950s][debug][stringdedup             ]       Shared:                0(  0.0%)
[15.951s][debug][stringdedup             ]       New:                6999( 71.2%)   298.5K
[15.951s][debug][stringdedup             ]       Replaced:             19(  0.3%)
[15.951s][debug][stringdedup             ]       Deleted:            2218( 31.7%)
[15.951s][debug][stringdedup             ]     Deduplicated:          976(  9.9%) 38088.0B( 12.5%)
[15.951s][debug][stringdedup             ]     Skipped: 20 (dead), 0 (incomplete), 0 (shared)
[15.951s][debug][stringdedup             ] Table: 4781 values in 503 buckets, 0 dead (2)

Total表示总的deduplication统计，和Last的统计含义类似，不再赘述。
Table: 4781 values in 503 buckets表示一共4781个value在503个bucket中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

# 配置建议

如果你的业务场景中，会有比较多的重复的字符串对象常驻在内存中，比如有比较多的本地缓存，缓存中存储的数据有比较多的重复字符串，建议开启。通过一定的CPU开销换取更低的内存占用。

在实际使用中，我们发现开启UseStringDeduplication能够降低old gen的大小（有比较多本地缓存这类数据的情况下），但是 UseStringDeduplication会带来一定程度的gc暂停时间变长。

如果觉得去重占用过多CPU，可以关闭（默认关闭）或调大-XX:StringDeduplicationAgeThreshold=3

# 参考

JEP 192: String Deduplication in G1 (opens new window)

← jstack实现原理分析 TLAB →