Fix pandas merge Overlapping Column Suffixes
After running pd.merge, columns that exist in both DataFrames under the same name — but are not listed in on= — appear as col_x and col_y in the output. If the suffix strings themselves collide you also get MergeError: columns overlap but no suffix specified. Neither outcome is what you want.
This page covers the root cause, how to diagnose which columns will collide before merging, and four concrete fixes: setting meaningful suffixes, dropping before merging, coalescing with combine_first, and using left_on/right_on for differently-named keys. For the broader merge workflow across many files, see Merging Multiple Spreadsheets.
Root Cause
pandas appends _x and _y to every non-key column that shares a name between the left and right DataFrames. Only the column(s) named in on= (or left_on=/right_on=) are treated as keys and deduplicated. Everything else gets suffixed to avoid silent data loss.
# pip install pandas
import pandas as pd
left = pd.DataFrame({"order_id": [1, 2], "status": ["open", "closed"], "amount": [100, 200]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid", "refund" ], "notes": ["ok", "check"]})
result = pd.merge(left, right, on="order_id")
print(result.columns.tolist())
# ['order_id', 'status_x', 'amount', 'status_y', 'notes']
status appears in both frames. Because it is not listed in on=, pandas duplicates it with _x / _y instead of silently discarding one side.
The suffix logic is intentional: pandas has no way to know which version of status is correct, so it keeps both and lets you decide. The problem is that _x/_y are meaningless names. If both frames also share a second non-key column (say updated_at), you get updated_at_x and updated_at_y as well. The output quickly becomes unreadable.
The default suffixes are controlled by the suffixes parameter, which defaults to ("_x", "_y"). Every fix below either changes those suffixes to something meaningful or eliminates the collision before the merge runs.
Minimal Diagnostic
Run this before merging to see exactly which columns will collide:
# pip install pandas
import pandas as pd
def find_overlapping_cols(left: pd.DataFrame, right: pd.DataFrame, key: str | list) -> list[str]:
"""Return non-key columns present in both frames."""
keys = {key} if isinstance(key, str) else set(key)
left_cols = set(left.columns) - keys
right_cols = set(right.columns) - keys
return sorted(left_cols & right_cols)
print(find_overlapping_cols(left, right, key="order_id"))
# ['status']
If this list is non-empty, choose one of the fixes below before calling pd.merge.
Knowing the colliding column names up front also tells you whether the overlap is intentional (both frames have genuinely different values for the same concept, like order status vs payment status) or accidental (both frames have a created_at audit column that carries identical values and should just be deduplicated). The fix differs for each case.
Fix 1 — Set Meaningful Suffixes
Pass suffixes= to replace the default _x/_y with labels that describe each source:
# pip install pandas
import pandas as pd
left = pd.DataFrame({"order_id": [1, 2], "status": ["open", "closed"], "amount": [100, 200]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid", "refund" ], "notes": ["ok", "check"]})
result = pd.merge(
left, right,
on="order_id",
suffixes=("_order", "_payment"), # replaces _x / _y
)
print(result.columns.tolist())
# ['order_id', 'status_order', 'amount', 'status_payment', 'notes']
Use this when both columns carry different, useful information (e.g., order status vs. payment status). The suffix applies to every overlapping column in one call — if status and updated_at both collide, both get the suffix.
When you have many overlapping columns and want to see the full renamed list before running the merge:
# pip install pandas
import pandas as pd
left = pd.DataFrame({"order_id": [1, 2], "status": ["open", "closed"], "updated_at": ["2026-01-01", "2026-01-02"]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid", "refund"], "updated_at": ["2026-01-05", "2026-01-06"], "notes": ["ok", "check"]})
# Preview which columns would be renamed before committing
overlap = find_overlapping_cols(left, right, key="order_id")
print({col: (col + "_order", col + "_payment") for col in overlap})
# {'status': ('status_order', 'status_payment'), 'updated_at': ('updated_at_order', 'updated_at_payment')}
Fix 2 — Drop or Rename Before Merging
When one side's version is authoritative, drop the other before the merge so no suffix is needed:
# pip install pandas
import pandas as pd
left = pd.DataFrame({"order_id": [1, 2], "status": ["open", "closed"], "amount": [100, 200]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid", "refund" ], "notes": ["ok", "check"]})
# Keep only the left-hand status; drop the right-hand duplicate before merging
result = pd.merge(
left,
right.drop(columns=["status"]), # remove the duplicate column
on="order_id",
)
print(result.columns.tolist())
# ['order_id', 'status', 'amount', 'notes']
Alternatively, rename the right-hand column to something meaningful before the merge:
# pip install pandas
import pandas as pd
left = pd.DataFrame({"order_id": [1, 2], "status": ["open", "closed"], "amount": [100, 200]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid", "refund" ], "notes": ["ok", "check"]})
right_renamed = right.rename(columns={"status": "payment_status"})
result = pd.merge(left, right_renamed, on="order_id")
print(result.columns.tolist())
# ['order_id', 'status', 'amount', 'payment_status', 'notes']
Renaming is cleaner than dropping when you genuinely need both values downstream but want them to have unambiguous names from the start. If the right-hand frame comes from a third-party source you don't control, renaming at the point of load (inside your load_files function) keeps the merge call clean.
When normalizing column names for merging, the same re.sub lowercase pattern used in Cleaning Messy CSV Data with pandas works here too — run it before the merge, not after.
Fix 3 — Coalesce with combine_first
When both frames have the same column but with gaps (one has values where the other has NaN), merge first and then coalesce:
# pip install pandas
import pandas as pd
left = pd.DataFrame({"id": [1, 2, 3], "score": [85.0, None, 90.0]})
right = pd.DataFrame({"id": [1, 2, 3], "score": [None, 72.0, None]})
result = pd.merge(left, right, on="id", suffixes=("_left", "_right"))
# Coalesce: take left value if available, fall back to right
result["score"] = result["score_left"].combine_first(result["score_right"])
result = result.drop(columns=["score_left", "score_right"]) # clean up
print(result)
# id score
# 0 1 85.0
# 1 2 72.0
# 2 3 90.0
combine_first fills NaN in the caller Series with values from the argument Series at matching index positions.
An equivalent using fillna:
# pip install pandas
import pandas as pd
result["score"] = result["score_left"].fillna(result["score_right"])
result = result.drop(columns=["score_left", "score_right"])
Both produce the same output. combine_first is slightly more idiomatic for "left-wins" coalescing; fillna is more explicit about the direction.
Variant — Merging on Differently-Named Keys
When the join key has different names in each frame, use left_on= / right_on=. pandas keeps both key columns in the output, which causes an apparent duplicate:
# pip install pandas
import pandas as pd
sales = pd.DataFrame({"sale_region": ["North", "South"], "revenue": [37000, 29400]})
targets = pd.DataFrame({"region": ["North", "South"], "target": [40000, 32000]})
result = pd.merge(sales, targets, left_on="sale_region", right_on="region")
print(result.columns.tolist())
# ['sale_region', 'revenue', 'region', 'target'] ← two region-like columns
Drop the redundant right-hand key after the merge:
# pip install pandas
import pandas as pd
sales = pd.DataFrame({"sale_region": ["North", "South"], "revenue": [37000, 29400]})
targets = pd.DataFrame({"region": ["North", "South"], "target": [40000, 32000]})
result = pd.merge(
sales, targets,
left_on="sale_region", right_on="region",
how="left",
validate="many_to_one", # raises MergeError if targets has duplicate region keys
)
result = result.drop(columns=["region"]) # remove the redundant right-hand key
print(result)
# sale_region revenue target
# 0 North 37000 40000
# 1 South 29400 32000
validate="many_to_one" raises pd.errors.MergeError immediately if the right-hand key is not unique, catching cardinality bugs before they silently inflate row counts.
Variant — MergeError: columns overlap but no suffix specified
This error fires when the suffixes tuple contains an empty string "" for a column that would otherwise collide:
# pip install pandas
import pandas as pd
left = pd.DataFrame({"id": [1], "val": [10]})
right = pd.DataFrame({"id": [1], "val": [20]})
try:
pd.merge(left, right, on="id", suffixes=("", "")) # both empty → MergeError
except pd.errors.MergeError as exc:
print(exc)
# columns overlap but no suffix specified: {'val'}
Fix: never pass ("", "") for overlapping non-key columns. Either remove the duplicate column from one frame first, or provide distinct non-empty suffixes.
A related variant is passing suffixes=(None, "_right") — the None side means "keep the original name for the left frame's column". This works when only the right frame's version is the duplicate:
# pip install pandas
import pandas as pd
left = pd.DataFrame({"id": [1, 2], "val": [10, 20], "notes": ["a", "b"]})
right = pd.DataFrame({"id": [1, 2], "val": [11, 22]})
result = pd.merge(left, right, on="id", suffixes=(None, "_right"))
print(result.columns.tolist())
# ['id', 'val', 'notes', 'val_right']
Use None for the side whose original column name should be preserved as-is. If both sides need renaming, provide two distinct strings.
Troubleshooting Table
| Symptom | Root cause | Fix |
|---|---|---|
col_x / col_y in output | Non-key column shared between both frames | Set suffixes= or drop/rename before merging |
MergeError: columns overlap but no suffix specified | suffixes=("", "") with overlapping column | Provide distinct non-empty suffixes or remove the overlap first |
| Row count unexpectedly multiplied | Right frame has duplicate key values, creating a many-to-many join | Add validate="many_to_one" or deduplicate the right frame's key |
Both key columns in output (sale_region + region) | Used left_on/right_on with different column names | Drop the redundant right-hand key with .drop(columns=[...]) after the merge |
KeyError on a column that looks present | Column name has hidden whitespace or different case after _x/_y rename | Print df.columns.tolist() to see exact names; normalize with str.strip().lower() |
Verification
After any of the fixes above, assert that no _x or _y columns remain:
# pip install pandas
import pandas as pd
def assert_no_suffix_columns(df: pd.DataFrame) -> None:
"""Raise if any column still has the default pandas merge suffixes."""
bad = [c for c in df.columns if c.endswith("_x") or c.endswith("_y")]
if bad:
raise AssertionError(f"Unexpected suffix columns after merge: {bad}")
print("OK — no _x/_y columns")
# assert_no_suffix_columns(result)
Also verify row count against expectations:
# pip install pandas
import pandas as pd
left = pd.DataFrame({"sale_region": ["North", "South"], "revenue": [37000, 29400]})
targets = pd.DataFrame({"region": ["North", "South"], "target": [40000, 32000]})
result = pd.merge(left, targets, left_on="sale_region", right_on="region", how="left")
result = result.drop(columns=["region"])
expected_rows = len(left) # for a left join, result should have same row count as left
assert len(result) == expected_rows, (
f"Row count changed: expected {expected_rows}, got {len(result)}. "
"Check for duplicate keys in the right frame."
)
For a full end-to-end pipeline that loads Excel files, merges, and exports, see Merging Multiple Spreadsheets. If you plan to convert the merged result to JSON, fix any _x/_y columns first — they produce ugly keys in the JSON output; see Converting Excel to JSON with Python.
Related
- Merging Multiple Spreadsheets — full workflow for concat, merge, and join across many files
- Reading Excel Files with Python — load individual workbooks before merging
- Cleaning Messy CSV Data with pandas — normalize column names before merging to prevent header-mismatch collisions
- Converting Excel to JSON with Python — export the clean merged table to JSON
Part of Merging Multiple Spreadsheets.