Every project starts the same way:
- – Client sends a messy CSV file
- – I write a quick script to clean it
- – A week later… they send another file, slightly different
- – I tweak the script again
- – Repeat until I’m buried in tiny, fragile one-off scripts
Sound familiar?
In the past, I treated CSV cleaning like it was a minor task—just whip up some Node.js, make the necessary fixes, and then get on with my day.
The Problem With One-Off Scripts
One-time scripts are fast to write and easy to forget. But they come back to haunt you when:
- A client changes the column order or headers
- You forget which script handles which format
- Someone else needs to run it—and it only works on your machine
- You end up repeating the same logic across 10 files
I was solving the same problems repeatedly:
- Normalize inconsistent column names
- Convert date formats
- Drop blank or duplicate rows
- Handle different encodings (UTF-8 with BOMs… hello darkness)
- Export the cleaned result
I didn’t need more scripts. I needed structure.
What I Do Now Instead
These days, when I come across a messy new file, I don’t just dive in from the beginning.
I’ve developed a handy approach that breaks things down into small, testable parts.
- input parsers (CSV, Excel, JSON)
- a normalization layer (headers, encodings)
- a transformation layer (date formatting, filters, maps)
- an output formatter (CSV, JSON, preview)
This isn’t a framework. It’s just a mindset:
Write it once → reuse it forever.
Example: Simple Modular Cleanup in Node.js
Instead of one giant script, I use small utilities like these:
parser.js
const fs = require("fs");
const csv = require("csv-parser");
function parseCSV(filePath) {
return new Promise((resolve, reject) => {
const results = [];
fs.createReadStream(filePath)
.pipe(csv())
.on("data", (row) => results.push(row))
.on("end", () => resolve(results))
.on("error", reject);
});
}
module.exports = { parseCSV };
cleaner.js
function cleanRows(data) {
return data
.filter(row => Object.values(row).some(val => val !== ""))
.map(row => ({
...row,
date: new Date(row.date).toISOString().split("T")[0], // Normalize date
name: row.name?.trim(), // Clean string
}));
}
module.exports = { cleanRows };
exporter.js
const { writeFileSync } = require("fs");
function exportCSV(data, path) {
const header = Object.keys(data[0]).join(",");
const rows = data.map(obj => Object.values(obj).join(",")).join("n");
writeFileSync(path, `${header}n${rows}`, "utf8");
}
module.exports = { exportCSV };
main.js
const { parseCSV } = require("./parser");
const { cleanRows } = require("./cleaner");
const { exportCSV } = require("./exporter");
async function runCleanup() {
const raw = await parseCSV("dirty.csv");
const cleaned = cleanRows(raw);
exportCSV(cleaned, "cleaned.csv");
}
runCleanup();
Now, whenever I receive a new file, I simply adjust my cleaner.js logic—no need to start from square one anymore.
Benefits of Moving Away From “Just Scripts”
- Less copy-paste, more confidence
- Easier to onboard clients or teammates
- Faster debugging (you know where the logic lives)
- Fewer edge-case surprises
- Scales from a 100-row file to 1 million+ rows
Now when I get a weird file with 12 columns, 3 date formats, and 2 “LOL” rows… I know my workflow can handle it.
Takeaways for Devs Handling Messy Data
- Your first script should solve the problem
- Your second should solve the pattern
- Your third should become a system
If you’re still writing one-off scripts for every client file:
no shame — we’ve all done it
but long term, it’s pain on repeat
If you’ve already moved to a modular, testable data-cleaning setup, I’d love to hear how you approached it