Download Free Install Free

Pandas Merge, Join, and Concat: How To and Examples

Table of Contents

Python’s Pandas library is a popular library for cleaning, manipulating, and interpreting large amounts of data. In an earlier post, we looked at the foundational structure in Pandas—the DataFrame. If you’re unfamiliar with Pandas DataFrames, take a look at that post to understand the basics.

In this post, we’ll learn how to combine multiple DataFrames using Pandas merge, join, and concat. These methods let you supercharge your data by gluing together data from different sources.

We’ll cover:

  • When to use concat and when to use merge or join;
  • How to use Pandas’ concat method;
  • How to use Pandas’ merge and join.

Let’s get started.

When to use the Pandas concat vs. merge and join

While merge, join, and concat all work to combine multiple DataFrames, they are used for very different things. In this section, we’ll learn when you will want to use one operation over another. The key distinction is whether you want to combine your DataFrames horizontally or vertically.

The concat method allows you to combine DataFrames vertically.

Imagine you had two DataFrames with the same columns. Perhaps the first DataFrame includes 10 rows of stock trading data for one stock while the second DataFrame includes 10 rows of stock trading data for a different stock. A vertical combination would use a DataFrame’s concat method to combine the two DataFrames into a single DataFrame with twenty rows.

Notice that in a vertical combination with concat, the number of rows has increased but the number of columns has stayed the same.

By contrast, the merge and join methods help to combine DataFrames horizontally.

Imagine you have two DataFrames. The first contains stock trading information various companies. The second contains information about the headquarters and numbers of employees for a particular company. If the two DataFrames have one field in common—such as a stock symbol or company name—you can combine the two DataFrames so that each row contains both the stock trading data and the company background information.

Notice that in this horizontal combination, we aren’t adding any additional rows. Rather, we’re adding columns to existing rows. The horizontal combination from a merge operation is similar to a JOIN operator in SQL.

Now that we understand the difference between vertical combinations with concat and horizontal combinations with merge or join, let’s take a deeper look at how to use these methods.

How to use Pandas’ concat method

In our previous post on Pandas DataFrames, we used sample stock data to show create, explore, and manipulate DataFrames. We used sample stock trading data from Amazon (“AMZN”), Apple (“AAPL”) and Google (“GOOG”).

Imagine we had a second data set with trading information on two additional companies, Facebook (“FB”) and Tesla (“TSLA”). These DataFrames have the same shape, so it would be useful to combine them to operate on them together.

We can do this with the concat method as follows:

>>> import pandas as pd
>>> stocks = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv')
>>> stocks.info()
< class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 7 columns):
date 15 non-null object
symbol 15 non-null object
open 15 non-null float64
high 15 non-null float64
low 15 non-null float64
close 15 non-null float64
volume 15 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 920.0+ bytes
>>> stocks2 = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/53ebac036b404875ef8e981c0cbd0901/raw/6c70336130eb7e45cec167ee7cd52d15baa392ea/stocks2.csv')
>>> combined = pd.concat([stocks, stock2], ignore_index=True)
>>> combined
date symbol open high low close volume
0 2019-03-01 AMZN 1655.13 1674.2600 1651.00 1671.73 4974877
1 2019-03-04 AMZN 1685.00 1709.4300 1674.36 1696.17 6167358
2 2019-03-05 AMZN 1702.95 1707.8000 1689.01 1692.43 3681522
3 2019-03-06 AMZN 1695.97 1697.7500 1668.28 1668.95 3996001
4 2019-03-07 AMZN 1667.37 1669.7500 1620.51 1625.95 4957017
5 2019-03-01 AAPL 174.28 175.1500 172.89 174.97 25886167
6 2019-03-04 AAPL 175.69 177.7500 173.97 175.85 27436203
7 2019-03-05 AAPL 175.94 176.0000 174.54 175.53 19737419
8 2019-03-06 AAPL 174.67 175.4900 173.94 174.52 20810384
9 2019-03-07 AAPL 173.87 174.4400 172.02 172.50 24796374
10 2019-03-01 GOOG 1124.90 1142.9700 1124.75 1140.99 1450316
11 2019-03-04 GOOG 1146.99 1158.2800 1130.69 1147.80 1446047
12 2019-03-05 GOOG 1150.06 1169.6100 1146.19 1162.03 1443174
13 2019-03-06 GOOG 1162.49 1167.5700 1155.49 1157.86 1099289
14 2019-03-07 GOOG 1155.72 1156.7600 1134.91 1143.30 1166559
15 2019-03-01 FB 162.60 163.1320 161.69 162.28 11097770
16 2019-03-04 FB 163.90 167.5000 163.83 167.37 18894689
17 2019-03-05 FB 167.37 171.8800 166.55 171.26 28187890
18 2019-03-06 FB 172.90 173.5700 171.27 172.51 21531723
19 2019-03-07 FB 171.50 171.7400 167.61 169.13 18306504
20 2019-03-01 TSLA 306.94 307.1300 291.90 294.79 22911375
21 2019-03-04 TSLA 298.12 299.0000 282.78 285.36 17096818
22 2019-03-05 TSLA 282.00 284.0000 270.10 276.54 18764740
23 2019-03-06 TSLA 276.48 281.5058 274.39 276.24 10335485
24 2019-03-07 TSLA 278.84 284.7000 274.25 276.59 9442483

In the example above, we create our first DataFrame and use the DataFrame’s info method to see that there are 15 rows and seven columns in our DataFrame. Then, we create our second DataFrame and combine it with our first DataFrame using the concat method. After this, we can see our DataFrame includes information on all five companies in across our two DataFrames.

If we use the info method to look at our new DataFrame, we can see that there are still seven columns but now there are 25 rows—the combination of the two DataFrames.

>>> combined.info()
< class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 9
Data columns (total 7 columns):
date 25 non-null object
symbol 25 non-null object
open 25 non-null float64
high 25 non-null float64
low 25 non-null float64
close 25 non-null float64
volume 25 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 1.6+ KB

The concat method is a great way to combine multiple DataFrames that contain similar data shapes. Note that the columns don’t need to match entirely—Pandas will simply include a null value for columns without values.

Using Pandas’ merge and join to combine DataFrames

The merge and join methods are a pair of methods to horizontally combine DataFrames with Pandas. This is a great way to enrich with DataFrame with the data from another DataFrame.

Both merge and join are operating in similar ways, but the join method is a convenience method to make it easier to combine DataFrames. In fact, join is using merge under the hood. I prefer to use join where possible as it’s slightly easier syntax.

When using either merge or join, you’ll need to specify how the DataFrames should be merged or joined. There are four possible values for how to join two DataFrames:

  • Left: Retain all rows for the first DataFrame and enrich with columns from the second DataFrame where they match on the columns on which to join;
  • Right: Same as left but reversed—retain all rows for the second DataFrame and enrich with columns from matches in the first DataFrame.
  • Inner: Retain only the intersection of the two DataFrames—rows in which there are values in both DataFrames for the columns on which the join is performed.
  • Outer: Retain all rows from both DataFrames regardless of whether there are matching rows in the other DataFrame.

These four join types match up with the four types of SQL joins. Note that the default value for how in the merge method is inner, while the default value for how in the join method is left. In general, it’s best to explicitly specify the type of join you want for clarity.

To understand how the different types of joins work, let’s walk through some examples.

Using a left join with Pandas join method

First, let’s see how to use the left join type. We will continue to use our combined stock data from the concat section above. Imagine we want to enrich that data with background information about the company, such as how many employees they have as well as where the company’s headquarters is located.

Let’s load our company background information, then join it with our trading data using a left join.

>>> companies = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/90954a10bbdebd89185f7b4d340563cd/raw/d91744e599356b7a52b9c027751d4613c06c5f40/companies.csv')
>>> companies
symbol employees headquarters_city headquarters_state
0 AMZN 613300 Seattle WA
1 GOOG 98771 Mountain View CA
2 AAPL 132000 Cupertino CA
>>> enriched = combined.join(companies.set_index('symbol'), on='symbol', how='left')
>>> enriched
date symbol open high low close volume employees headquarters_city headquarters_state
0 2019-03-01 AMZN 1655.13 1674.2600 1651.00 1671.73 4974877 613300.0 Seattle WA
1 2019-03-04 AMZN 1685.00 1709.4300 1674.36 1696.17 6167358 613300.0 Seattle WA
2 2019-03-05 AMZN 1702.95 1707.8000 1689.01 1692.43 3681522 613300.0 Seattle WA
3 2019-03-06 AMZN 1695.97 1697.7500 1668.28 1668.95 3996001 613300.0 Seattle WA
4 2019-03-07 AMZN 1667.37 1669.7500 1620.51 1625.95 4957017 613300.0 Seattle WA
5 2019-03-01 AAPL 174.28 175.1500 172.89 174.97 25886167 132000.0 Cupertino CA
6 2019-03-04 AAPL 175.69 177.7500 173.97 175.85 27436203 132000.0 Cupertino CA
7 2019-03-05 AAPL 175.94 176.0000 174.54 175.53 19737419 132000.0 Cupertino CA
8 2019-03-06 AAPL 174.67 175.4900 173.94 174.52 20810384 132000.0 Cupertino CA
9 2019-03-07 AAPL 173.87 174.4400 172.02 172.50 24796374 132000.0 Cupertino CA
10 2019-03-01 GOOG 1124.90 1142.9700 1124.75 1140.99 1450316 98771.0 Mountain View CA
11 2019-03-04 GOOG 1146.99 1158.2800 1130.69 1147.80 1446047 98771.0 Mountain View CA
12 2019-03-05 GOOG 1150.06 1169.6100 1146.19 1162.03 1443174 98771.0 Mountain View CA
13 2019-03-06 GOOG 1162.49 1167.5700 1155.49 1157.86 1099289 98771.0 Mountain View CA
14 2019-03-07 GOOG 1155.72 1156.7600 1134.91 1143.30 1166559 98771.0 Mountain View CA
15 2019-03-01 FB 162.60 163.1320 161.69 162.28 11097770 NaN NaN NaN
16 2019-03-04 FB 163.90 167.5000 163.83 167.37 18894689 NaN NaN NaN
17 2019-03-05 FB 167.37 171.8800 166.55 171.26 28187890 NaN NaN NaN
18 2019-03-06 FB 172.90 173.5700 171.27 172.51 21531723 NaN NaN NaN
19 2019-03-07 FB 171.50 171.7400 167.61 169.13 18306504 NaN NaN NaN
20 2019-03-01 TSLA 306.94 307.1300 291.90 294.79 22911375 NaN NaN NaN
21 2019-03-04 TSLA 298.12 299.0000 282.78 285.36 17096818 NaN NaN NaN
22 2019-03-05 TSLA 282.00 284.0000 270.10 276.54 18764740 NaN NaN NaN
23 2019-03-06 TSLA 276.48 281.5058 274.39 276.24 10335485 NaN NaN NaN
24 2019-03-07 TSLA 278.84 284.7000 274.25 276.59 9442483 NaN NaN NaN

Let’s walk through what’s happening here.

First, we load our companies DataFrame from a URL. When we look at that DataFrame, we can see it has data for three of our companies—AMZN, AAPL, and GOOG—but not for two others—FB and TSLA.

We then use the join() method to combine our two DataFrames. We do this using a left join and specify that the join should occur on the symbol column, which is common to both DataFrames.

After our join is complete, we print out the DataFrame. You can see that there are the 25 rows from the original DataFrame and that some of those rows have been enriched with information about company headquarters and number of employees. However, it retained those rows that didn’t have matching information in the second DataFrame.

Using an inner join with Pandas join method

Now let’s see how the same two DataFrames would be joined when using an inner join.

>>> enriched_inner = combined.join(companies.set_index('symbol'), on='symbol', how='inner')
>>> enriched_inner
date symbol open high low close volume employees headquarters_city headquarters_state
0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877 613300 Seattle WA
1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358 613300 Seattle WA
2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522 613300 Seattle WA
3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001 613300 Seattle WA
4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017 613300 Seattle WA
5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167 132000 Cupertino CA
6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203 132000 Cupertino CA
7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419 132000 Cupertino CA
8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384 132000 Cupertino CA
9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374 132000 Cupertino CA
10 2019-03-01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316 98771 Mountain View CA
11 2019-03-04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047 98771 Mountain View CA
12 2019-03-05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174 98771 Mountain View CA
13 2019-03-06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289 98771 Mountain View CA
14 2019-03-07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559 98771 Mountain View CA

Our join code is the exact same as the previous section except that we changed from a left to an inner join. When we print out the DataFrame, you can see the difference—we only have fifteen lines in our combined DataFrame. The 10 lines of FB and TSLA that didn’t have a match in our companies DataFrame were dropped entirely rather than including null values for the missing columns.

The type of join you want to use will depend on your use case. In these two examples, we learned how the type of join affects which rows are retained in a join operation.

Conclusion

In this post, we learned how to combine DataFrames in Pandas. We learned that there are two types of ways to combine DataFrames—horizontally or vertically. We then walked through examples of combining vertically with concat() and horizontally with join() or merge(). Finally, we learned about the different types of joins with using join() or merge().

Remembering the different ways to combine DataFrames or to perform joins can be tricky, but Kite has you covered. Kite is an autocomplete for Python to speed up your workflow while coding. Further, it helps you look up Python docs without going to Google. Kite even provides examples of how other people have used similar libraries and methods, allowing you to take advantage of the expertise of others. Check it out today!