Fix pandas merge Overlapping Column Suffixes

After running pd.merge, columns that exist in both DataFrames under the same name — but are not listed in on= — appear as col_x and col_y in the output. If the suffix strings themselves collide you also get MergeError: columns overlap but no suffix specified. Neither outcome is what you want.

This page covers the root cause, how to diagnose which columns will collide before merging, and four concrete fixes: setting meaningful suffixes, dropping before merging, coalescing with combine_first, and using left_on/right_on for differently-named keys. For the broader merge workflow across many files, see Merging Multiple Spreadsheets.

Root Cause

pandas appends _x and _y to every non-key column that shares a name between the left and right DataFrames. Only the column(s) named in on= (or left_on=/right_on=) are treated as keys and deduplicated. Everything else gets suffixed to avoid silent data loss.

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"order_id": [1, 2], "status": ["open",  "closed"], "amount": [100, 200]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid",  "refund" ], "notes":  ["ok", "check"]})

result = pd.merge(left, right, on="order_id")
print(result.columns.tolist())
# ['order_id', 'status_x', 'amount', 'status_y', 'notes']

status appears in both frames. Because it is not listed in on=, pandas duplicates it with _x / _y instead of silently discarding one side.

The suffix logic is intentional: pandas has no way to know which version of status is correct, so it keeps both and lets you decide. The problem is that _x/_y are meaningless names. If both frames also share a second non-key column (say updated_at), you get updated_at_x and updated_at_y as well. The output quickly becomes unreadable.

The default suffixes are controlled by the suffixes parameter, which defaults to ("_x", "_y"). Every fix below either changes those suffixes to something meaningful or eliminates the collision before the merge runs.

Minimal Diagnostic

Run this before merging to see exactly which columns will collide:

# pip install pandas
import pandas as pd

def find_overlapping_cols(left: pd.DataFrame, right: pd.DataFrame, key: str | list) -> list[str]:
    """Return non-key columns present in both frames."""
    keys = {key} if isinstance(key, str) else set(key)
    left_cols  = set(left.columns)  - keys
    right_cols = set(right.columns) - keys
    return sorted(left_cols & right_cols)

print(find_overlapping_cols(left, right, key="order_id"))
# ['status']

If this list is non-empty, choose one of the fixes below before calling pd.merge.

Knowing the colliding column names up front also tells you whether the overlap is intentional (both frames have genuinely different values for the same concept, like order status vs payment status) or accidental (both frames have a created_at audit column that carries identical values and should just be deduplicated). The fix differs for each case.

Fix 1 — Set Meaningful Suffixes

Pass suffixes= to replace the default _x/_y with labels that describe each source:

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"order_id": [1, 2], "status": ["open",  "closed"], "amount": [100, 200]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid",  "refund" ], "notes":  ["ok", "check"]})

result = pd.merge(
    left, right,
    on="order_id",
    suffixes=("_order", "_payment"),   # replaces _x / _y
)
print(result.columns.tolist())
# ['order_id', 'status_order', 'amount', 'status_payment', 'notes']

Use this when both columns carry different, useful information (e.g., order status vs. payment status). The suffix applies to every overlapping column in one call — if status and updated_at both collide, both get the suffix.

When you have many overlapping columns and want to see the full renamed list before running the merge:

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"order_id": [1, 2], "status": ["open", "closed"], "updated_at": ["2026-01-01", "2026-01-02"]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid", "refund"], "updated_at": ["2026-01-05", "2026-01-06"], "notes": ["ok", "check"]})

# Preview which columns would be renamed before committing
overlap = find_overlapping_cols(left, right, key="order_id")
print({col: (col + "_order", col + "_payment") for col in overlap})
# {'status': ('status_order', 'status_payment'), 'updated_at': ('updated_at_order', 'updated_at_payment')}

Fix 2 — Drop or Rename Before Merging

When one side's version is authoritative, drop the other before the merge so no suffix is needed:

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"order_id": [1, 2], "status": ["open",  "closed"], "amount": [100, 200]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid",  "refund" ], "notes":  ["ok", "check"]})

# Keep only the left-hand status; drop the right-hand duplicate before merging
result = pd.merge(
    left,
    right.drop(columns=["status"]),   # remove the duplicate column
    on="order_id",
)
print(result.columns.tolist())
# ['order_id', 'status', 'amount', 'notes']

Alternatively, rename the right-hand column to something meaningful before the merge:

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"order_id": [1, 2], "status": ["open",  "closed"], "amount": [100, 200]})
right = pd.DataFrame({"order_id": [1, 2], "status": ["paid",  "refund" ], "notes":  ["ok", "check"]})

right_renamed = right.rename(columns={"status": "payment_status"})
result = pd.merge(left, right_renamed, on="order_id")
print(result.columns.tolist())
# ['order_id', 'status', 'amount', 'payment_status', 'notes']

Renaming is cleaner than dropping when you genuinely need both values downstream but want them to have unambiguous names from the start. If the right-hand frame comes from a third-party source you don't control, renaming at the point of load (inside your load_files function) keeps the merge call clean.

When normalizing column names for merging, the same re.sub lowercase pattern used in Cleaning Messy CSV Data with pandas works here too — run it before the merge, not after.

Fix 3 — Coalesce with combine_first

When both frames have the same column but with gaps (one has values where the other has NaN), merge first and then coalesce:

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"id": [1, 2, 3], "score": [85.0, None,  90.0]})
right = pd.DataFrame({"id": [1, 2, 3], "score": [None, 72.0, None]})

result = pd.merge(left, right, on="id", suffixes=("_left", "_right"))

# Coalesce: take left value if available, fall back to right
result["score"] = result["score_left"].combine_first(result["score_right"])
result = result.drop(columns=["score_left", "score_right"])   # clean up
print(result)
#    id  score
# 0   1   85.0
# 1   2   72.0
# 2   3   90.0

combine_first fills NaN in the caller Series with values from the argument Series at matching index positions.

An equivalent using fillna:

# pip install pandas
import pandas as pd

result["score"] = result["score_left"].fillna(result["score_right"])
result = result.drop(columns=["score_left", "score_right"])

Both produce the same output. combine_first is slightly more idiomatic for "left-wins" coalescing; fillna is more explicit about the direction.

Variant — Merging on Differently-Named Keys

When the join key has different names in each frame, use left_on= / right_on=. pandas keeps both key columns in the output, which causes an apparent duplicate:

# pip install pandas
import pandas as pd

sales   = pd.DataFrame({"sale_region": ["North", "South"], "revenue": [37000, 29400]})
targets = pd.DataFrame({"region":      ["North", "South"], "target":  [40000, 32000]})

result = pd.merge(sales, targets, left_on="sale_region", right_on="region")
print(result.columns.tolist())
# ['sale_region', 'revenue', 'region', 'target']  ← two region-like columns

Drop the redundant right-hand key after the merge:

# pip install pandas
import pandas as pd

sales   = pd.DataFrame({"sale_region": ["North", "South"], "revenue": [37000, 29400]})
targets = pd.DataFrame({"region":      ["North", "South"], "target":  [40000, 32000]})

result = pd.merge(
    sales, targets,
    left_on="sale_region", right_on="region",
    how="left",
    validate="many_to_one",   # raises MergeError if targets has duplicate region keys
)
result = result.drop(columns=["region"])   # remove the redundant right-hand key
print(result)
#   sale_region  revenue  target
# 0       North    37000   40000
# 1       South    29400   32000

validate="many_to_one" raises pd.errors.MergeError immediately if the right-hand key is not unique, catching cardinality bugs before they silently inflate row counts.

Variant — MergeError: columns overlap but no suffix specified

This error fires when the suffixes tuple contains an empty string "" for a column that would otherwise collide:

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"id": [1], "val": [10]})
right = pd.DataFrame({"id": [1], "val": [20]})

try:
    pd.merge(left, right, on="id", suffixes=("", ""))  # both empty → MergeError
except pd.errors.MergeError as exc:
    print(exc)
    # columns overlap but no suffix specified: {'val'}

Fix: never pass ("", "") for overlapping non-key columns. Either remove the duplicate column from one frame first, or provide distinct non-empty suffixes.

A related variant is passing suffixes=(None, "_right") — the None side means "keep the original name for the left frame's column". This works when only the right frame's version is the duplicate:

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"id": [1, 2], "val": [10, 20], "notes": ["a", "b"]})
right = pd.DataFrame({"id": [1, 2], "val": [11, 22]})

result = pd.merge(left, right, on="id", suffixes=(None, "_right"))
print(result.columns.tolist())
# ['id', 'val', 'notes', 'val_right']

Use None for the side whose original column name should be preserved as-is. If both sides need renaming, provide two distinct strings.

Troubleshooting Table

Symptom	Root cause	Fix
`col_x` / `col_y` in output	Non-key column shared between both frames	Set `suffixes=` or drop/rename before merging
`MergeError: columns overlap but no suffix specified`	`suffixes=("", "")` with overlapping column	Provide distinct non-empty suffixes or remove the overlap first
Row count unexpectedly multiplied	Right frame has duplicate key values, creating a many-to-many join	Add `validate="many_to_one"` or deduplicate the right frame's key
Both key columns in output (`sale_region` + `region`)	Used `left_on`/`right_on` with different column names	Drop the redundant right-hand key with `.drop(columns=[...])` after the merge
`KeyError` on a column that looks present	Column name has hidden whitespace or different case after `_x`/`_y` rename	Print `df.columns.tolist()` to see exact names; normalize with `str.strip().lower()`

Verification

After any of the fixes above, assert that no _x or _y columns remain:

# pip install pandas
import pandas as pd

def assert_no_suffix_columns(df: pd.DataFrame) -> None:
    """Raise if any column still has the default pandas merge suffixes."""
    bad = [c for c in df.columns if c.endswith("_x") or c.endswith("_y")]
    if bad:
        raise AssertionError(f"Unexpected suffix columns after merge: {bad}")
    print("OK — no _x/_y columns")

# assert_no_suffix_columns(result)

Also verify row count against expectations:

# pip install pandas
import pandas as pd

left  = pd.DataFrame({"sale_region": ["North", "South"], "revenue": [37000, 29400]})
targets = pd.DataFrame({"region": ["North", "South"], "target": [40000, 32000]})

result = pd.merge(left, targets, left_on="sale_region", right_on="region", how="left")
result = result.drop(columns=["region"])

expected_rows = len(left)           # for a left join, result should have same row count as left
assert len(result) == expected_rows, (
    f"Row count changed: expected {expected_rows}, got {len(result)}. "
    "Check for duplicate keys in the right frame."
)

For a full end-to-end pipeline that loads Excel files, merges, and exports, see Merging Multiple Spreadsheets. If you plan to convert the merged result to JSON, fix any _x/_y columns first — they produce ugly keys in the JSON output; see Converting Excel to JSON with Python.

Merging Multiple Spreadsheets — full workflow for concat, merge, and join across many files
Reading Excel Files with Python — load individual workbooks before merging
Cleaning Messy CSV Data with pandas — normalize column names before merging to prevent header-mismatch collisions
Converting Excel to JSON with Python — export the clean merged table to JSON

Part of Merging Multiple Spreadsheets.