Pyspark sql distinct. I have a PySpark dataframe with a column URL in it.

Pyspark sql distinct Returns Column distinct values of these two column values. Nov 29, 2023 · distinct() eliminates duplicate records (matching all columns of a Row) from DataFrame, count () returns the count of records on DataFrame. These are distinct() and dropDuplicates() . Running SQL with PySpark # PySpark offers two main ways to perform SQL operations: Using spark. Whether you’re tallying totals, averaging values, or counting occurrences, these functions—available through pyspark. In this article, we’ll explore their capabilities, syntax, and practical examples to help you use them effectively. These come in handy when we need to perform operations on an array (ArrayType) column. collect_list(col) [source] # Aggregate function: Collects the values from a column into a list, maintaining duplicates, and returns this list of objects. Jul 17, 2023 · When using a pyspark dataframe, we sometimes need to select unique rows or unique values from a particular column. Returns Column the column for computed results. Mar 21, 2024 · Finding Distinct Elements: The array_distinct(col) function returns a new array column with distinct elements. An alias of count_distinct(), and it is encouraged to use count_distinct() directly. In this PySpark article, I will explain both union transformations with PySpark examples. It allows developers to seamlessly integrate SQL queries with Spark programs, making it easier to work with structured data using the familiar SQL language. countDistinct deals with the null value is not intuitive for me. Apr 26, 2024 · Spark with Scala provides several built-in SQL standard array functions, also known as collection functions in DataFrame API. . listagg_distinct # pyspark. The distinct() function allows you to eliminate duplicate records and focus on unique data. Jul 30, 2025 · PySpark union () and unionAll () transformations are used to merge two or more DataFrame’s of the same schema or structure. Nov 29, 2022 · Spark SQL approx_count_distinct Window Function as a Count Distinct Alternative The approx_count_distinct windows function returns the estimated number of distinct values in a column within the group. count_distinct # pyspark. , over a range of input rows. Jan 29, 2025 · In this short article, we will explore the nuances of the distinct and dropDuplicates functions in PySpark, providing a deeper understanding of how these two essential functions work and when to Nov 19, 2025 · In this article, I’ve consolidated and listed all PySpark Aggregate functions with Python examples and also learned the benefits of using PySpark SQL functions. countDistinct(col: ColumnOrName, *cols: ColumnOrName) → pyspark. approx_count_distinct(col: ColumnOrName, rsd: Optional[float] = None) → pyspark. select("x"). Here are five key points about distinct (): Mar 27, 2024 · PySpark distinct() transformation is used to drop/remove the duplicate rows (all columns) from DataFrame and dropDuplicates() is used to drop rows based on selected (one or multiple) columns. functions. Case 3: PySpark Distinct multiple columns If you want to check distinct values of multiple columns together then in the select add multiple columns and then apply distinct on it. Let’s see these two ways with examples. Use the distinct () method to perform deduplication of rows. cols Column or column name other columns to compute on. distinct() [source] # Returns a new DataFrame containing the distinct rows in this DataFrame. listagg_distinct(col, delimiter=None) [source] # Aggregate function: returns the concatenation of distinct non-null input values, separated by the delimiter. show(); shows that there are duplicate rows all over the place. © Copyright Databricks. do your thing Here's a class I created to do this: Parameters col Column or str name of column or expression Returns Column A new column that is an array of unique values from the input column. approxCountDistinct simply calls pyspark. I have a PySpark dataframe with a column URL in it. PySpark SQL provides a DataFrame API for manipulating data in a distributed and fault-tolerant manner. withColumn("unique_array The pyspark. select to select the columns on which you want to apply the duplication and the returned Dataframe contains only these selected columns while dropDuplicates(colNames) will return all the columns of the initial dataframe after removing duplicated rows as per the columns. This function is particularly useful when working with large datasets that may contain redundant or Oct 19, 2020 · The main difference is the consideration of the subset of columns which is great! When using distinct you need a prior . distinct ()” function returns a new DataFrame with unique rows, making it a simple and efficient way to count distinct values. In this article, you will learn how to use distinct () and dropDuplicates () functions with PySpark example. Extract unique values in a column using PySpark. In this article, we are going to explore how both of these In PySpark, the distinct() function is used to retrieve unique rows from a Dataframe. All these PySpark Functions return Aug 8, 2017 · I'm trying to get the distinct values of a column in a dataframe in Pyspark, to them save them in a list, at the moment the list contains "Row (no_children=0)" but I need only the value as I will use it for another part of my code. All I want to know is how many distinct values are there. May 1, 2023 · Today I would like to share with you different kind of article – In the past I have developed a comprehensive cheatsheet to assist myself with the mastery of PySpark. functions —transform your DataFrames into concise metrics, all Chapter 6: Old SQL, New Tricks - Running SQL on PySpark # Introduction # This section explains how to use the Spark SQL API in PySpark and compare it with the DataFrame API. Column ¶ Returns a new Column for distinct count of col or cols. Here are the differences and use cases to understand behavior, particularly in SQL-like environments (e. I will explain how to use these two functions in this article and learn the differences with examples. Aug 13, 2022 · Of the various ways that you've tried, e. Sep 23, 2025 · PySpark Window functions are used to calculate results, such as the rank, row number, etc. functions as F df. What is the Distinct Operation in PySpark? The distinct method in PySpark DataFrames removes duplicate rows from a dataset, returning a new DataFrame with only unique entries. It returns a new Dataframe with distinct rows based on all the columns of the original Dataframe. com The Spark DataFrame API comes with two functions that can be used in order to remove duplicates from a given DataFrame. distinct method is a valuable tool in the toolkit of data engineers and data teams when working with large datasets in Apache Spark. from pyspark. Select distinct rows in PySpark DataFrame The distinct () method in Apache PySpark DataFrame is used to generate a new DataFrame containing only unique rows based on all columns. This tutorial provides several examples of how to use this function with the following PySpark DataFrame: Learn how to get unique values in a column in PySpark with this step-by-step guide. Jan 20, 2024 · Removing duplicate rows or data using Apache Spark (or PySpark), can be achieved in multiple ways by using operations like drop_duplicate, distinct and groupBy. column. Apr 17, 2025 · If you’re more comfortable with SQL, PySpark’s SQL module lets you filter duplicates using familiar SQL syntax. In this blog, we are going to learn aggregation functions in Spark. Oct 31, 2016 · import pyspark. It also covers how to switch between the two APIs seamlessly, along with some practical tips and tricks. Dec 19, 2023 · apache-spark pyspark apache-spark-sql count distinct edited Dec 19, 2023 at 14:04 ZygD 24. 3. collect_list # pyspark. agg ()” function allows for more customization by allowing the use of You can use the Pyspark countDistinct() function to get a count of the distinct values in a column of a Pyspark dataframe. pyspark. agg ()” function, and the “pivot” function. 6. Jun 6, 2021 · In this article, we are going to display the distinct column values from dataframe using pyspark in Python. All these array functions accept input as an array column and several other arguments based on the function. EXCEPT EXCEPT and EXCEPT ALL return the rows that are found in one relation but not the other Oct 6, 2023 · The easiest way to obtain a list of unique values in a PySpark DataFrame column is to use the distinct function. distinct () is Mar 21, 2016 · For PySPark; I come from an R/Pandas background, so I'm actually finding Spark Dataframes a little easier to work with. sum_distinct(col) [source] # Aggregate function: returns the sum of distinct values in the expression. distinct() Overview The distinct() function is used to select distinct rows from a DataFrame. Changed in version 3. Does it looks a bug or normal for you ? And if it is normal, how I can write something that output exactly the result of the first approach but in the same spirit than the second Method. Oct 10, 2023 · This tutorial explains how to select distinct rows in a PySpark DataFrame, including several examples. So regardless the one you use, the very same code runs in the end. countDistinct ¶ pyspark. Spark SQL supports three types of set operators: EXCEPT or MINUS INTERSECT UNION Note that input relations must have the same number of columns and compatible data types for the respective columns. Examples Example 1: Using string_agg_distinct function I'm using the following code to agregate students per year. In this article, we will discuss how to select distinct rows or values in a column of a pyspark dataframe using three different ways. Examples Example 1: Removing duplicate values from a simple array Nov 4, 2023 · According to Abzooba, calculating distinct counts in PySpark can run around 2-3x faster than pandas or SQL due to optimized distributed engines. df. Mar 30, 2021 · Spark sql distinct count over window function [duplicate] Asked 4 years, 7 months ago Modified 4 years, 7 months ago Viewed 3k times pyspark. Aug 26, 2024 · Difference between distinct () and dropDuplicates () In PySpark, both distinct () and dropDuplicates () are used to remove duplicate rows from a DataFrame. These essential functions include collect_list, collect_set, array_distinct, explode, pivot, and stack. Examples Let’s look at some examples of getting the distinct values in a Pyspark column. count_distinct(col, *cols) [source] # Returns a new Column for distinct count of col or cols. It allows you to efficiently filter out duplicate rows, leaving you with only the unique records. Jul 10, 2025 · PySpark SQL is a very important and most used module that is used for structured data processing. The distinct function in PySpark is a useful tool for identifying and removing duplicate rows from a DataFrame. It returns a new DataFrame containing distinct rows, leaving the May 30, 2021 · In this article we are going to get the distinct data from pyspark dataframe in Python, So we are going to create the dataframe using a nested list and get the distinct data. Jul 4, 2021 · In this article, we will discuss how to find distinct values of multiple columns in PySpark dataframe. distinct. 0. , SQL, PySpark, etc. dropDuplicates ( [“department”,”salary”]) will only consider the Oct 25, 2024 · Introduction In this tutorial, we want to count the distinct values of a PySpark DataFrame column. array_distinct ¶ pyspark. Sep 11, 2018 · I have seen a lot of performance improvement in my pyspark code when I replaced distinct() on a spark data frame with groupBy(). Then I want to calculate the distinct values on every column. ). First, we’ll create a Pyspark dataframe that we’ll be using throughout this tutorial. This tutorial covers both the `distinct()` and `dropDuplicates()` functions, and provides code examples for each. functions to work with DataFrame and SQL queries. pyspark. functions Oct 13, 2025 · PySpark SQL provides several built-in standard functions pyspark. sql import SparkSession from pyspark. distinct (), df. Is it true for Apache Spark SQL? Mar 5, 2025 · Srini Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. functions import array_distinct df = df. It returns a new array column with distinct elements, eliminating any duplicates present in the original array. Is there an efficient method to also show the number of times these distinct values occur in the data frame? (count for each distinct value) Oct 10, 2023 · Learn the syntax of the array\\_distinct function of the SQL language in Databricks SQL and Databricks Runtime. Filtering Out Duplicates 44 I just tried doing a countDistinct over a window and got this error: AnalysisException: u'Distinct window functions are not supported: count (distinct color#1926) Is there a way to do a distinct count over a window in pyspark? Here's some example code: Apr 6, 2022 · Example 2: Pyspark Count Distinct from DataFrame using SQL query. Nov 16, 2025 · This comprehensive guide is designed to explore the specific methods available within PySpark to efficiently select either distinct rows or distinct values from specific columns within a DataFrame. countDistinct("a","b","c")). countDistinct () is used to get the count of unique values of the specified column. Column ¶ Aggregate function: returns a new Column for approximate distinct count of column col. Distinct rows are rows with unique values across all columns. show() shows the distinct values that are present in x column of edf DataFrame. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications. It would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe. When processing data, we need to a lot of different functions so it is a good thing Spark has provided us many in built functions. Examples Example 1: Counting distinct values of a single column pyspark. Aggregate Functions in PySpark: A Comprehensive Guide PySpark’s aggregate functions are the backbone of data summarization, letting you crunch numbers and distill insights from vast datasets with ease. Mar 21, 2025 · When working with data manipulation and aggregation in PySpark, having the right functions at your disposal can greatly enhance efficiency and productivity. agg(F. distinct ()” function, the “. The default value is None. However, there are some differences in Jul 30, 2009 · Functions ! != % & * + - / < << <= <=> <> = == > >= >> >>> ^ abs acos acosh add_months aes_decrypt aes_encrypt aggregate and any any_value approx_count_distinct approx_percentile array array_agg array_append array_compact array_contains array_distinct array_except array_insert array_intersect array_join array_max array_min array_position array_prepend array_remove array_repeat array_size array Here, we use the select() function to first select the column (or columns) we want to get the distinct values for and then apply the distinct() function. distinct # DataFrame. What am I missing? Quick reference for essential PySpark functions with examples. select ('column'). 1 version I need to fetch distinct values on a column and then perform some specific transformation on top of it. 0: Supports Spark Connect. sql. Import Libraries First, we import the following python modules: from pyspark. show() 1 It seems that the way F. 8k 41 106 144 Oct 16, 2023 · This tutorial explains how to count distinct values in a PySpark DataFrame, including several examples. Parameters col Column or column name first column to compute on. In this article, I’ve explained the concept of window functions, syntax, and finally how to use them with PySpark SQL and PySpark DataFrame API. delimiter Column, literal string or bytes, optional the delimiter to separate the values. I generate a dictionary for aggregation with something like: from pyspark. Also, still according to the source code, approx_count_distinct is based on the HyperLogLog++ algorithm. count () method and the countDistinct () function of PySpark. But I failed to understand the reason behind it. count () etc. In this example, we have created a dataframe containing employee details like Emp_name, Depart, Age, and Salary. Jun 14, 2024 · In this example, we are creating pyspark dataframe with 11 rows and 3 columns and get the distinct count from rollno and marks column. Column ¶ Aggregate function: returns the sum of distinct values in the expression. In my journey to become proficient in Spark, I initially leveraged my familiarity with SQL to facilitate my understanding of data frames in PySpark. Column [source] ¶ Returns a new Column for distinct count of col or cols. approx_count_distinct(col, rsd=None) [source] # This aggregate function returns a new Column, which estimates the approximate distinct count of elements in a specified column or a group of columns. sum_distinct(col: ColumnOrName) → pyspark. Mar 27, 2024 · PySpark SQL collect_list() and collect_set() functions are used to create an array (ArrayType) column on DataFrame by merging rows, typically after group by or window partitions. I have tried the following df. functions as f Apr 3, 2024 · Counting the distinct values in PySpark can be done using three different methods: the “. approx_count_distinct # pyspark. The column contains more than 50 million records and can grow large Jul 29, 2016 · How to get distinct rows in dataframe using pyspark? Asked 9 years, 3 months ago Modified 7 years, 2 months ago Viewed 63k times Parameters col Column or column name target column to compute on. array_distinct(col: ColumnOrName) → pyspark. Sep 22, 2024 · DISTINCT and COLLECT_SET are two vital functions used in Data analysis. I just need the number of total distinct values. May 16, 2024 · By using countDistinct () PySpark SQL function you can get the count distinct of the DataFrame that resulted from PySpark groupBy (). For this, we are using distinct () and dropDuplicates () functions along with select () function. Let us see its example. sql() # The spark. The method resolves columns by position (not by name), following the standard behavior in SQL. Here are five key points about distinct (): Jun 21, 2016 · edf. To do this: Setup a Spark SQL context Read your file into a dataframe Register your dataframe as a temp table Query it directly using SQL syntax Save results as objects, output to files. Learn techniques with PySpark distinct, dropDuplicates, groupBy with count, and other methods. Returns a new DataFrame containing the distinct rows in this DataFrame. New in version 1. Jan 19, 2024 · In this example, distinct () will consider all columns and remove any rows that are identical across all columns. It is useful for removing duplicate records in a DataFrame A new column that is an array of unique values from the input column. Aug 2, 2024 · Understanding the differences between distinct () and dropDuplicates () in PySpark allows you to choose the right method for removing duplicates based on your specific use case. The choice of operation to remove Apr 24, 2024 · In this Spark SQL tutorial, you will learn different ways to get the distinct values in every column or selected multiple columns in a DataFrame using Using Spark 1. Oct 30, 2023 · This tutorial explains how to use groupBy with count distinct in PySpark, including several examples. com If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. approx_count_distinct, nothing more except giving you a warning. It’s a transformation operation, meaning it’s lazy—Spark plans the deduplication but waits for an action like show to execute it. distinct() and dropDuplicates() returns a new DataFrame. In this article, we will discuss how to count distinct values in one or multiple columns in pyspark. How to achieve this using pyspark dataframe functions ? Feb 17, 2023 · In PySpark, distinct is a transformation operation that is used to return a new DataFrame with distinct (unique) elements. This method performs a SQL-style set union of the rows from both DataFrame objects, with no automatic deduplication of elements. The whole intention pyspark. It helps in data cleansing and ensures that each row in the resulting DataFrame is unique. The purpose is to know the total number of student for each year. Parameters col Column or str name of column or expression Examples Sep 8, 2016 · I want to select distinct rows, then take a sample and limit it to 300 records, however df. Jul 7, 2021 · I am trying to run aggregation on a dataframe. Examples Example 1: Combining two DataFrames with the same schema Sep 1, 2020 · As you can see in the source code pyspark. Column ¶ Collection function: removes duplicate values from the array. Selecting distinct in a pyspark dataframeSelecting Distinct Rows in a DataFrame - . The main difference is distinct () performs on all columns whereas dropDuplicates () is used on selected columns. g. DataFrame. By registering a DataFrame as a temporary view, you can use DISTINCT or ROW_NUMBER () to handle duplicates. Nov 25, 2024 · Aggregation Functions are important part of big data analytics. Even though both methods pretty much do the same job, they actually come with one difference which is quite important in some use cases. groupby ('column'). , what is the most efficient way to extract distinct values from a column? Mar 27, 2024 · What is the difference between PySpark distinct () vs dropDuplicates () methods? Both these methods are used to drop duplicate rows from the DataFrame and return DataFrame with unique values. count_distinct(col: ColumnOrName, *cols: ColumnOrName) → pyspark. Nov 6, 2024 · Explore various methods to retrieve unique values from a PySpark DataFrame column without using SQL queries or groupby operations. Let's create a sample dataframe for demonstration: Introduction to the array_distinct function The array_distinct function in PySpark is a powerful tool that allows you to remove duplicate elements from an array column in a DataFrame. functions import col import pyspark. Set Operators Description Set operators are used to combine two input relations into a single one. Learn data transformations, string manipulation, and more in the cheat sheet. Jan 14, 2019 · The question is pretty much in the title: Is there an efficient way to count the distinct values in every column in a DataFrame? The describe method provides only the count but not the distinct co pyspark: get the distinct elements of list values Asked 5 years, 11 months ago Modified 5 years, 11 months ago Viewed 4k times Jul 24, 2023 · While handling data in pyspark, we often need to find the count of distinct values in one or multiple columns in a pyspark dataframe. Count This is one of basic function where we count number of records or specify column to count. By chaining these you can get the count distinct of PySpark DataFrame. sql Mar 28, 2019 · I have 10+ columns and want to take distinct rows by multiple columns into consideration. The “. Nov 22, 2025 · Learn practical PySpark groupBy patterns, multi-aggregation with aliases, count distinct vs approx, handling null groups, and ordering results. In order to do this, we use the distinct (). Feb 21, 2021 · Photo by Juliana on unsplash. sum_distinct # pyspark. select Jan 1, 2022 · I've heard an opinion that using DISTINCT can have a negative impact on big data workloads, and that the queries with GROUP BY were more performant. Created using Sphinx 3. # import the below modules Nov 8, 2023 · This tutorial explains how to perform a union between two PySpark DataFrames and only return distinct rows, including an example. 4. See full list on sparkbyexamples. countDistinct() is a SQL function that could be used to get the count distinct of the selected multiple columns. nmiqbiv lrsf kwwn xshe mhzl ocqvq nukq bdufdwk mvjh vylmq dipod jcyq gnrzii usyc qitylej