123
-=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- (c) WidthPadding Industries 1987 0|698|0 -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=-
Socoder -> Web Development -> MySQL, removing duplicates... but not blindly

Tue, 17 Aug 2010, 08:02
HoboBen
I have a table, tblReleases, which may look like this:



As you can see, multiple artists can create an album with the same title. However, sometimes an artist has a duplicate copy of their album, but with a different date.

I'm trying to select all records where an albumTitle is shared by the same artist, then choose the minimum releaseDate. I can't simply make (artistId, albumTitle) unique, because I need the minimum releaseDate.

Here's my current idea:

select albumTitle,count(albumTitle) as n from tblReleases group by albumTitle having n>1 limit 50;

This gives me:



So then, for each result, select the album by albumTitle, compare artistIds, and if any artistIds match delete the one(s) with the later date.

I think that this is probably a terrible way to go about it. Particularly, I'm sure there should be a way to improve my first query to consider artistIds. Hopefully even a query that can do it all in one go.

Can anyone offer a suggestion?

(I definitely need to buy a book on this subject)

Thanks!


|edit| I now have:

select * from tblReleases where txtTitle in (select txtTitle from tblReleases group by txtTitle having count(txtTitle)>1) limit 10;

but it's taking almost a second per row! |edit|



-=-=-
blog | work | code | more code
Tue, 17 Aug 2010, 09:17
JL235
One thing that might make it massively faster is if you push the album details out to it's own table. Something like:

tblReleases = { artistID, albumID, releaseDate }
tblAlbums = { albumID, title, other stuff ... }

That should make finding duplicate album names _much_ faster for this scenario, and should also lower your memory footprint (cos the name itself is only stored once).
Tue, 17 Aug 2010, 09:22
HoboBen
That's a good idea. Ideally the name *would* only be stored once, but this requires removing the duplicates first.

I think though, for the moment I'm just going to add unique(artistId, albumTitle). It's cheating, but will at least it will ignore any future duplicate albums inserted with later dates.

Thanks!

-=-=-
blog | work | code | more code
Tue, 17 Aug 2010, 09:29
JL235
... and the SQL you need should be something like:


Not sure if it's 'releaseDate DESC' or 'releaseDate ASC', so you might want to try flipping it if your getting the last rather then soonest.

|edit| I think I misread your original problem, I think the SQL above is what you truly want. It should (hopefully) print out all albums for a given artistID but only contain the minimum release date. |edit|
Tue, 17 Aug 2010, 10:58
waroffice
Coool didnt realise you could nest the select statements. thats ace.

PS: im a n00b when it comes to php/mysql so i have no idea how to solve your problem
Tue, 17 Aug 2010, 13:42
Stealth
This may be what you need. It's untested so it might not even work. The DISTINCT parameter selects only a distinct field. I've ordered by releaseDate so that the oldest one will be the distinctly selected one. Play around with this and let me know if you have any luck.



-=-=-
Quit posting and try Google.
Tue, 17 Aug 2010, 13:59
HoboBen
Thanks very much Stealth, I will give that a go!
Wed, 18 Aug 2010, 13:35
Stealth
Did this work for you?
Thu, 19 Aug 2010, 10:15
HoboBen
Not yet; I've been playing too much Borderlands! Should have time on the weekend though. Cheers

-=-=-
blog | work | code | more code