Value_counts and Groupby in Pandas Explained in Easy Steps
Analysts use value_counts and groupby in Pandas to explore a dataset and summarize information fast. This tutorial explains value_counts and groupby in Pandas with simple examples that beginners understand. When you learn value_counts and groupby in Pandas, you get better at summarizing data quickly. Most data analysis tasks depend on value_counts and groupby in Pandas because they help you convert raw data into insights.
The Dataset (bmw.csv)
Download Sample Dataset: bmw.csv
1. Load the Dataset
You start by importing Pandas and loading the BMW CSV file. The read_csv() function reads the file and creates a DataFrame. Calling head() shows the first five rows, which lets you confirm whether the data loaded correctly.
Python:
import pandas as pd
df = pd.read_csv('bmw.csv')
df.head()Output:
model year price transmission mileage fuelType tax mpg engineSize 2 Series 2018 16250 Manual 10401 Petrol 145 52.3 1.5 5 Series 2014 11200 Automatic 67068 Diesel 125 57.6 2.0 6 Series 2018 27000 Automatic 14827 Petrol 145 42.8 2.0 5 Series 2016 16000 Automatic 62794 Diesel 160 51.4 3.0 1 Series 2017 12750 Automatic 26676 Diesel 145 72.4 1.5
2. Understanding value_counts()
The value_counts() function counts how many times each unique value appears in a column. It sorts the results in descending order, returns a Series, and uses the unique values as the index. This makes it useful when you want to understand category frequency.
However, many beginners skip value_counts even though it saves time.
2.1 Count Transmission Types
Python:
df.value_counts() When this runs, Pandas checks the transmission column, groups identical values together, counts their occurrences, and returns the counts.
Output:
transmission count Semi-Auto 4666 Automatic 3588 Manual 2527
This shows that semi-automatic transmissions appear the most in the dataset.
In addition, both functions work well when you analyse multiple columns.
2.2 Get Proportions Using normalize=True
Python:
df.value_counts(normalize=True) The normalize=True argument tells Pandas to divide each count by the total number of rows. Instead of raw counts, you receive proportions. Output transmission proportion Semi-Auto 0.432798 Automatic 0.332808 Manual 0.234394
This means about 43 percent of all BMW cars in the dataset are Semi-Auto.
2.3 Convert to Percentage
Python:
df.value_counts(normalize=True) * 100Multiplying by 100 changes the proportions into readable percentages.
Output
transmission percent Semi-Auto 43.279844 Automatic 33.280772 Manual 23.439384
2.4 Disable Sorting Using sort=False
Python:
df.value_counts(sort=False)The sort=False option tells Pandas to return counts in the order values appear in the column instead of sorting them.
Output
transmission count Manual 2527 Automatic 3588 Semi-Auto 4666
2.5 Sort by Index
Python:
df.value_counts().sort_index()Here, Pandas counts the values and then sorts them alphabetically by their index. That helps when you want an organized output rather than descending order.
Output
transmission count Automatic 3588 Manual 2527 Semi-Auto 4666
2.6 Reset the Index
Python:
df.value_counts().reset_index()The reset_index() function converts the Series into a DataFrame. The old index becomes a normal column, which makes the output easier to join with other tables or export.
Output
index transmission Semi-Auto 4666 Automatic 3588 Manual 2527
3. Count Multiple Columns Using value_counts()
You can count combinations by passing multiple columns. Pandas treats each unique pair as a separate category and counts how often it appears.
Python:
df].value_counts(normalize=True)
Pandas groups by both year and transmission together, then counts how many rows fall into each pair.
Output (truncated)
year transmission count 2019 Semi-Auto 2005 2019 Automatic 997 2016 Automatic 736 2017 Semi-Auto 672
This reveals that 2019 had the highest number of semi-automatic cars.
3.1 Get Proportion for Multiple Columns
Python:
df].value_counts(normalize=True)Pandas now divides each pair count by the total number of rows and returns proportions instead of raw frequencies.
Output:
year transmission proportion 2019 Semi-Auto 0.185975 2019 Automatic 0.092478 2016 Automatic 0.068268
3.2 Percentage Format
Python:
df].value_counts(normalize=True) * 100This returns percentage values, making it easier to read and compare the categories.
Output:
year transmission percent 2019 Semi-Auto 18.59 2019 Automatic 9.24
3.3 Sort Index on Both Columns
Python:
mdf = df].value_counts(normalize=True) * 100
mdf.sort_index()sort_index() sorts based on the tuple index consisting of year and transmission. This gives neatly organized output ordered by both columns.
Output:
year transmission percent 1996 Automatic 0.009276 1997 Manual 0.009276 2020 Semi-Auto 5.509693
3.4 Reset Index and Sort
Python:
mdf = mdf.reset_index()
mdf.sort_values(, ascending=)After resetting the index, the columns become normal DataFrame columns. Sorting them gives better control over the display order, such as sorting by year ascending and transmission descending.
Output:
year transmission percent 1996 Automatic 0.009276 1997 Manual 0.009276 ... ... ... 2020 Automatic 0.704944
4. Understanding groupby()
The groupby() function groups rows based on a column and then performs calculations on each group. You can compute mean, median, sum, max, min, count, and more. Pandas separates the data into groups internally and then applies your chosen operation to each group independently.
4.1 Average Mileage per Transmission
Python:
mdf = df.groupby('transmission').agg(avg_mileage=('mileage','mean'))
mdfAll rows are grouped by transmission type in this code, which then chooses the mileage column within each group, computes the mean, and outputs the results in a new DataFrame.
Output:
transmission avg_mileage Automatic 31830.67 Manual 34720.48 Semi-Auto 15631.34
4.2 Reset Index
Python:
mdf.reset_index()Resetting the index converts the group labels back into normal DataFrame columns.
4.3 Sort by Mileage
Python:
mdf.sort_values('avg_mileage', ascending=False)Sorting the result arranges the transmission types based on average mileage. This makes comparisons easier.
Manual cars show the highest average mileage among the three.
Output:
transmission avg_mileage Manual 34720.48 Automatic 31830.67 Semi-Auto 15631.34
Insight:
Manual cars have the highest mileage.
5. Groupby With Multiple Aggregations
A dictionary-style structure within agg() allows you to compute multiple statistics simultaneously. Pandas adds the outcomes to new output columns after applying each operation to the designated column.
For example, groupby helps you split data before applying calculations.
Python:
mdf = df.groupby('transmission').agg(
avg_mileage=('mileage','mean'),
median_mileage=('mileage','median'),
car_count=('model','count')
)Here, Pandas groups by transmission type and calculates three different values. It takes the mean and median of mileage and counts the number of models in each group.
Output:
transmission avg_mileage median_mileage car_count Manual 34720.48 30000 2527 Automatic 31830.67 26193.5 3588 Semi-Auto 15631.34 9036.5 4666
6. Price and Engine Size Analysis
You can combine many aggregations across multiple columns. Pandas handles each operation one by one inside each transmission group.
Python:
mdf = df.groupby('transmission').agg(
model_count=('model','count'),
total_price=('price','sum'),
avg_price=('price','mean'),
median_price=('price','median'),
max_price=('price','max'),
min_price=('price','min'),
avg_engine_size=('engineSize','mean')
)Pandas divides the data by transmission first. The total number of cars, the average price, the median price, the highest and lowest prices, the sum of prices, and the average engine size for each transmission group are then determined.
Output:
transmission model_count total_price avg_price max_price min_price avg_engine_size Automatic 3588 80439458 22419.02 99950 1200 2.25 Manual 2527 36988839 14637.45 46000 1445 1.83 Semi-Auto 4666 127660584 27359.74 123456 5299 2.27
7. Best Fuel Type for Mileage
You can group by fuel type to examine mileage performance across different fuel categories.
Python:
mdf = df.groupby('fuelType').agg(
avg_mileage=('mileage','mean'),
median_mileage=('mileage','median')
).sort_values('avg_mileage', ascending=False)Fuel types are grouped by Pandas, which then calculates mileage statistics and sorts the results so that the fuel type with the best mileage shows up first.
Output:
fuelType avg_mileage median_mileage Other 44760.63 44351.5 Diesel 29852.80 24600 Hybrid 23400.53 16900 Electric 19925.33 20321 Petrol 16524.09 9097
8. Multi-Column Groupby
You can group by more than one column at the same time. Pandas creates a separate group for every transmission and fuel type combination.
Python:
mdf = df.groupby().agg(
avg_price=('price','mean')
)Pandas reads each unique pair of transmission and fuelType, groups rows accordingly, calculates the average price inside each group, and returns the results in a multi-index DataFrame.
Output:
transmission fuelType avg_price Automatic Diesel 21216.70 Automatic Petrol 25364.46 Manual Diesel 13722.74 Semi-Auto Petrol 28209.39 Semi-Auto Hybrid 30115.37
9. Multi Aggregation on Price and Mileage
Here, you calculate the average price, total price, and average mileage for each transmission and fuel type combination.
As a result, your summaries become cleaner and easier to compare.
Python:
mdf = df.groupby().agg(
avg_price=('price','mean'),
total_price=('price','sum'),
avg_mileage=('mileage','mean'),
)Pandas handles each calculation separately but keeps the same group structure. The result gives deep insights into how transmission and fuel type together affect price and mileage.
Finally, you understand how to combine value_counts and groupby for deeper analysis.
Output:
transmission fuelType avg_price total_price avg_mileage Automatic Diesel 21216.70 52808380 36286.72 Semi-Auto Petrol 28209.39 46601921 11222.51
Final Thought
You looked at how value_counts() facilitates working with multiple columns, formatting results, calculating proportions, and counting values. In addition to applying computations like mean, median, sum, count, and more, you learned how to group rows using the groupby() function. You also saw how multi-aggregation and multi-column grouping produce effective summaries. These features provide you with powerful tools for efficiently and rapidly analyzing complicated datasets.
Internal link:
Aggregate Functions in Pandas: Beginner’s Guide with Examples
Filtering in Pandas: Learn loc, iloc, isin(), and between() Load Data in Pandas – A Complete Beginner’s Guide to Data Import
Pandas official documentation:
https://pandas.pydata.org/docs/ag
Conclusion:
Data science keeps changing how teams solve problems. You get clearer insights, stronger decisions, and better results when you use data in your work. Keep building your skills and stay curious about new tools. Your progress depends on how you apply data in real situations.
Want to know what else can be done by Data Science?
If you wish to learn more about data science or want to advance your career in the data science field, feel free to join our free workshop on Master’s in Data Science with Power BI, where you will get to know how exactly the data science field works and why companies are ready to pay handsome salaries in this field. In this workshop, you will get to know each tool and technology from scratch, which will make you skillfully eligible for any data science profile. To join this workshop, register yourself on ConsoleFlare, and we will call you back.
Thinking, Why Console Flare?
Recently, ConsoleFlare has been recognized as one of the Top 10 Most Promising Data Science Training Institutes of 2023. Console Flare offers the opportunity to learn Data Science in Hindi, just like how you speak daily. Console Flare believes in the idea of “What to learn and what not to learn,” and this can be seen in their curriculum structure. They have designed their program based on what you need to learn for data science and nothing else.
Want more reasons?
Register yourself on ConsoleFlare, and we will call you back. Log in or sign up to view See posts, photos, and more on Facebook. value_counts and groupby in Pandas
