I need help finding the best program / method to clean / manage my database. Simple instructions on the best software / method to solve the following problem:
I have 10 million + entries in an excel file (10 sheets because of limit). I want to combine them into 1 program that is good for managing large csv files as I have outgrown excel.
Each day I add roughly 500,000 entries. I want to then select duplicates from the whole sheet, except first entries. Example, if there was a column containing a value in the initial 10 million entries such as "EXAMPLE1", and I import my 500,000 entries. If one or more of the new entries then contains "EXAMPLE1" I want to be able to remove all of the duplicates that I have just imported, apart from the first one that already existed. (The same would apply for any duplicates that are added when importing the new data, not just one value).
Basically the goal is to have one column that contains no duplicates, and each day as the database grows I can ensure that the data added doesn't match any of the excising data in a specific column.
Currently using excel and kutools to do this, and it takes 24+ hours to complete the operation. I am looking for a better method.
I'd like to take a crack at your problem. Could you share a sample of your data and the "id" column that makes your column unique? Looking forward. Balaji
6 freelancers are bidding on average $22 for this job
I would propose using MongoDB, MySQL, etc. If you require full text and fast searching Elasticsearch might be a good choice too. Please open a quick chat to discuss :)