Java 8 – Find duplicate elements in a Stream

Wesley13
• 阅读 816

https://mkyong.com/java8/java-8-find-duplicate-elements-in-a-stream/

This article shows you three algorithms to find duplicate elements in a Stream.

  • Set.add()
  • Collectors.groupingBy
  • Collections.frequency

At the end of the article, we use the JMH benchmark to test which one is the fastest algorithm.

1. Filter & Set.add()

The Set.add() returns false if the element was already in the set; let see the benchmark at the end of the article.

JavaDuplicated1.java

package com.mkyong;

import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;

public class JavaDuplicated1 {

    public static void main(String[] args) {

        // 3, 4, 9
        List<Integer> list = Arrays.asList(5, 3, 4, 1, 3, 7, 2, 9, 9, 4);

        Set<Integer> result = findDuplicateBySetAdd(list);

        result.forEach(System.out::println);

    }

    public static <T> Set<T> findDuplicateBySetAdd(List<T> list) {

        Set<T> items = new HashSet<>();
        return list.stream()
                .filter(n -> !items.add(n)) // Set.add() returns false if the element was already in the set.
                .collect(Collectors.toSet());

    }

}

Output

3
4
9

2. Map & Collectors.groupingBy

2.1 Create a Map by Collectors.groupingBy and find elements that count > 1.

JavaDuplicated2.java

package com.mkyong;

import java.util.*;
import java.util.function.Function;
import java.util.stream.Collectors;

public class JavaDuplicated2 {

    public static void main(String[] args) {

        // 3, 4, 9
        List<Integer> list = Arrays.asList(5, 3, 4, 1, 3, 7, 2, 9, 9, 4);

        Set<Integer> result = findDuplicateByGrouping(list);

        result.forEach(System.out::println);

    }

    public static <T> Set<T> findDuplicateByGrouping(List<T> list) {

        return list.stream()
                .collect(Collectors.groupingBy(Function.identity()
                        , Collectors.counting()))    // create a map {1=1, 2=1, 3=2, 4=2, 5=1, 7=1, 9=2}
                .entrySet().stream()                 // Map -> Stream
                .filter(m -> m.getValue() > 1)       // if map value > 1, duplicate element
                .map(Map.Entry::getKey)
                .collect(Collectors.toSet());

    }

}

Output

3
4
9

3. Collections.frequency

It compares each item with a list – Collections.frequency(list, i).

JavaDuplicated3.java

package com.mkyong;

import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;

public class JavaDuplicated3 {

    public static void main(String[] args) {

        // 3, 4, 9
        List<Integer> list = Arrays.asList(5, 3, 4, 1, 3, 7, 2, 9, 9, 4);

        Set<Integer> result = findDuplicateByFrequency(list);

        result.forEach(System.out::println);

    }

    public static <T> Set<T> findDuplicateByFrequency(List<T> list) {

        return list.stream().filter(i -> Collections.frequency(list, i) > 1)
                .collect(Collectors.toSet());

    }

}

Output

3
4
9

4. JMH Benchmark

A simple JMH benchmark for the above three algorithms, find duplicated elements from a Stream, which has a size of 1000.

pom.xml

  <dependency>
      <groupId>org.openjdk.jmh</groupId>
      <artifactId>jmh-core</artifactId>
      <version>1.23</version>
  </dependency>

  <dependency>
      <groupId>org.openjdk.jmh</groupId>
      <artifactId>jmh-generator-annprocess</artifactId>
      <version>1.23</version>
  </dependency>

BenchmarkFindDuplicate.java

package com.mkyong;

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import java.util.*;
import java.util.concurrent.TimeUnit;
import java.util.function.Function;
import java.util.stream.Collectors;

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
@Fork(value = 2, jvmArgs = {"-Xms4G", "-Xmx4G"})
public class BenchmarkFindDuplicate {

    private List<Integer> DATA_FOR_TESTING;

    @Setup
    public void init() {
        // random 1000 size
        DATA_FOR_TESTING = new Random().ints(1000, 1, 1000)
                .boxed()
                .collect(Collectors.toList());
    }

    public static void main(String[] args) throws RunnerException {

        Options opt = new OptionsBuilder()
                .include(BenchmarkFindDuplicate.class.getSimpleName())
                .forks(1)
                .build();

        new Runner(opt).run();

    }

    @Benchmark
    public void setAdd(Blackhole bh) {

        Set<Integer> items = new HashSet<>();
        Set<Integer> collect = DATA_FOR_TESTING.stream()
                .filter(n -> !items.add(n))
                .collect(Collectors.toSet());

        bh.consume(collect);

    }

    @Benchmark
    public void groupingBy(Blackhole bh) {

        Set<Integer> collect = DATA_FOR_TESTING.stream()
                .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()))
                .entrySet()
                .stream()
                .filter(m -> m.getValue() > 1)
                .map(Map.Entry::getKey)
                .collect(Collectors.toSet());

        bh.consume(collect);

    }

    @Benchmark
    public void frequency(Blackhole bh) {

        Set<Integer> collect = DATA_FOR_TESTING.stream()
                .filter(i -> Collections.frequency(DATA_FOR_TESTING, i) > 1)
                .collect(Collectors.toSet());

        bh.consume(collect);

    }

}

Output

# JMH version: 1.23
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
# VM invoker: /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64/bin/java
# VM options: -Xms4G -Xmx4G
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.mkyong.BenchmarkFindDuplicate.frequency

# Run progress: 0.00% complete, ETA 00:05:00
# Fork: 1 of 1
# Warmup Iteration   1: 0.827 ms/op
# Warmup Iteration   2: 0.821 ms/op
# Warmup Iteration   3: 0.812 ms/op
# Warmup Iteration   4: 0.822 ms/op
# Warmup Iteration   5: 0.822 ms/op
Iteration   1: 0.814 ms/op
Iteration   2: 0.810 ms/op
Iteration   3: 0.779 ms/op
Iteration   4: 0.776 ms/op
Iteration   5: 0.814 ms/op


Result "com.mkyong.BenchmarkFindDuplicate.frequency":
  0.799 ±(99.9%) 0.075 ms/op [Average]
  (min, avg, max) = (0.776, 0.799, 0.814), stdev = 0.019
  CI (99.9%): [0.724, 0.874] (assumes normal distribution)


# JMH version: 1.23
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
# VM invoker: /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64/bin/java
# VM options: -Xms4G -Xmx4G
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.mkyong.BenchmarkFindDuplicate.groupingBy

# Run progress: 33.33% complete, ETA 00:03:20
# Fork: 1 of 1
# Warmup Iteration   1: 0.040 ms/op
# Warmup Iteration   2: 0.038 ms/op
# Warmup Iteration   3: 0.037 ms/op
# Warmup Iteration   4: 0.036 ms/op
# Warmup Iteration   5: 0.039 ms/op
Iteration   1: 0.039 ms/op
Iteration   2: 0.039 ms/op
Iteration   3: 0.039 ms/op
Iteration   4: 0.039 ms/op
Iteration   5: 0.039 ms/op


Result "com.mkyong.BenchmarkFindDuplicate.groupingBy":
  0.039 ±(99.9%) 0.001 ms/op [Average]
  (min, avg, max) = (0.039, 0.039, 0.039), stdev = 0.001
  CI (99.9%): [0.038, 0.040] (assumes normal distribution)


# JMH version: 1.23
# VM version: JDK 11.0.6, OpenJDK 64-Bit Server VM, 11.0.6+10
# VM invoker: /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64/bin/java
# VM options: -Xms4G -Xmx4G
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: com.mkyong.BenchmarkFindDuplicate.setAdd

# Run progress: 66.67% complete, ETA 00:01:40
# Fork: 1 of 1
# Warmup Iteration   1: 0.027 ms/op
# Warmup Iteration   2: 0.028 ms/op
# Warmup Iteration   3: 0.026 ms/op
# Warmup Iteration   4: 0.026 ms/op
# Warmup Iteration   5: 0.027 ms/op
Iteration   1: 0.026 ms/op
Iteration   2: 0.027 ms/op
Iteration   3: 0.028 ms/op
Iteration   4: 0.028 ms/op
Iteration   5: 0.028 ms/op


Result "com.mkyong.BenchmarkFindDuplicate.setAdd":
  0.027 ±(99.9%) 0.003 ms/op [Average]
  (min, avg, max) = (0.026, 0.027, 0.028), stdev = 0.001
  CI (99.9%): [0.024, 0.031] (assumes normal distribution)


# Run complete. Total time: 00:05:01

REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.

Benchmark                          Mode  Cnt  Score   Error  Units
BenchmarkFindDuplicate.frequency   avgt    5  0.799 ± 0.075  ms/op
BenchmarkFindDuplicate.groupingBy  avgt    5  0.039 ± 0.001  ms/op
BenchmarkFindDuplicate.setAdd      avgt    5  0.027 ± 0.003  ms/op

Process finished with exit code 0

In Java 8 Stream, filter with Set.Add() is the fastest algorithm to find duplicate elements, because it loops only one time.

  Set<T> items = new HashSet<>();
  return list.stream()
        .filter(n -> !items.add(n))
        .collect(Collectors.toSet());

The Collections.frequency is the slowest because it compares each item with a list – Collections.frequency(list, i). If we increase the size of the list, the performance will get slower.

return list.stream().filter(i -> Collections.frequency(list, i) > 1)
        .collect(Collectors.toSet());

References

  • Java JMH Benchmark Tutorial
  • How to count duplicated items in Java List
  • Stream JavaDoc

benchmark collectors duplicated group by java 8 jmh stream

点赞
收藏
评论区
推荐文章
blmius blmius
3年前
MySQL:[Err] 1292 - Incorrect datetime value: ‘0000-00-00 00:00:00‘ for column ‘CREATE_TIME‘ at row 1
文章目录问题用navicat导入数据时,报错:原因这是因为当前的MySQL不支持datetime为0的情况。解决修改sql\mode:sql\mode:SQLMode定义了MySQL应支持的SQL语法、数据校验等,这样可以更容易地在不同的环境中使用MySQL。全局s
待兔 待兔
5个月前
手写Java HashMap源码
HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程HashMap的使用教程22
Jacquelyn38 Jacquelyn38
3年前
2020年前端实用代码段,为你的工作保驾护航
有空的时候,自己总结了几个代码段,在开发中也经常使用,谢谢。1、使用解构获取json数据let jsonData  id: 1,status: "OK",data: 'a', 'b';let  id, status, data: number   jsonData;console.log(id, status, number )
Wesley13 Wesley13
3年前
Java日期时间API系列31
  时间戳是指格林威治时间1970年01月01日00时00分00秒起至现在的总毫秒数,是所有时间的基础,其他时间可以通过时间戳转换得到。Java中本来已经有相关获取时间戳的方法,Java8后增加新的类Instant等专用于处理时间戳问题。 1获取时间戳的方法和性能对比1.1获取时间戳方法Java8以前
Stella981 Stella981
3年前
KVM调整cpu和内存
一.修改kvm虚拟机的配置1、virsheditcentos7找到“memory”和“vcpu”标签,将<namecentos7</name<uuid2220a6d1a36a4fbb8523e078b3dfe795</uuid
Easter79 Easter79
3年前
Twitter的分布式自增ID算法snowflake (Java版)
概述分布式系统中,有一些需要使用全局唯一ID的场景,这种时候为了防止ID冲突可以使用36位的UUID,但是UUID有一些缺点,首先他相对比较长,另外UUID一般是无序的。有些时候我们希望能使用一种简单一些的ID,并且希望ID能够按照时间有序生成。而twitter的snowflake解决了这种需求,最初Twitter把存储系统从MySQL迁移
Wesley13 Wesley13
3年前
mysql设置时区
mysql设置时区mysql\_query("SETtime\_zone'8:00'")ordie('时区设置失败,请联系管理员!');中国在东8区所以加8方法二:selectcount(user\_id)asdevice,CONVERT\_TZ(FROM\_UNIXTIME(reg\_time),'08:00','0
Wesley13 Wesley13
3年前
00:Java简单了解
浅谈Java之概述Java是SUN(StanfordUniversityNetwork),斯坦福大学网络公司)1995年推出的一门高级编程语言。Java是一种面向Internet的编程语言。随着Java技术在web方面的不断成熟,已经成为Web应用程序的首选开发语言。Java是简单易学,完全面向对象,安全可靠,与平台无关的编程语言。
Wesley13 Wesley13
3年前
MySQL部分从库上面因为大量的临时表tmp_table造成慢查询
背景描述Time:20190124T00:08:14.70572408:00User@Host:@Id:Schema:sentrymetaLast_errno:0Killed:0Query_time:0.315758Lock_
Python进阶者 Python进阶者
10个月前
Excel中这日期老是出来00:00:00,怎么用Pandas把这个去除
大家好,我是皮皮。一、前言前几天在Python白银交流群【上海新年人】问了一个Pandas数据筛选的问题。问题如下:这日期老是出来00:00:00,怎么把这个去除。二、实现过程后来【论草莓如何成为冻干莓】给了一个思路和代码如下:pd.toexcel之前把这