We want to reconstruct a catlog of CD and record releases that we had but lost. This information was on separate HTML pages. We call the Release pages, as they hold inforation on the CD release. Our website and the data however were deleted and lost.
Luckily most of the website exists on the web archive site [url removed, login to view], (this stores snapshots of websites over time and archives them).
The website we are scraping is [url removed, login to view]://[url removed, login to view]
Here are some examples of the release pages
[url removed, login to view]://[url removed, login to view] (an example without the track list)
[url removed, login to view]://[url removed, login to view] (an example with the track list)
the 2005 ..etc etc bit of the URL seems to be an address and re-writes itself.
We are looking for a program which will visit a serially visits the Release pages in a sequence and scrapes off the information from the various sections/headings/fields and then puts any data into a simple Excel spreadsheet. The data scraped from each release page will do into one row on the XLS.
As for the technology / language for this not really that fussed. As long I can run it from Windows XP without having to install anything so VB in Excel or Java is fine.
It needs a simple inter which takes has two options 1) a range of numbers from X to Y it OR a specific URL
The logic is as follows
1) Pop up an Input Box asking the user to input the “Lower Release Page Number?” with a Go Button, when they press Go store this as X, then…
2) Pop up an Input Box asking the user to input the “Upper Release Page Number?” with a Go Button, when they press Go store this as Y, then…
3) Pop up an Input Box asking the user to input the “Interval? (Usually 1)” with a Go Button, when they press Go store this as Z, then…
4) Create an XLS (or Delmited file if not using Excel called C:\[url removed, login to view] (or .txt) where AAAAAA = the current time with seconds. E.g. 165901
6) FIND LOOP: While N < Y Do loop
7) Import the Page “[url removed, login to view]://[url removed, login to view] "
8) If that Page doesn't open and it is a dead URL then print in a error log window N+” Not a Valid Page”.Then a new line. Goto END
9) If the page does open then load it and parse the page to copy the values from the fields as specified in the field table below. Search for keyboards and use string functions. Remove leading and trailing spaces from the data scraped. Normalise any ALL CAPED words, e.g. MAKEBA, MYRIAM = Makeba, Myriam.
10) Create a new row in the XLS and write out the values as detailed in the field table below.
11) If the image for that album is not the default tile (called [url removed, login to view]) then save the image as c:\scrape\[url removed, login to view] where BBBB= the cat number just scraped
11) Print in the Error log window N+” Successfully scraped”, then a new line
12) N=N+1 (unless Z <>1 then move then N=N+Z to forward by the interval to skip some pages)
13) END While
14) When finished print in the Error log page “*********DONE**********”
In the XLS / Tab Sheet
Column A = Artist = Scrape from the top section (e.g. MAKEBA, MYRIAM)
Column B = Release = Scrape from the top section (e.g. The Click Song)
Column C = Label = Scrape from "Label:" section (e.g. SONO)
Column D = Cat Number = Scrape from "Code article:" section (e.g. D5564)
Column E = Format = Scrape from "Support:" section (e.g. CD)
Column F = Family = Scrape from "Famille:" section (e.g. Musiques du onde)
Column G = Date = Scrape from "Date de sortie:" section (e.g. 04 eptembre 1998)
Column H = Genre = Scrape from "Genre:" section (e.g. Afrique Du Sud)
Column I = Barcode = Scrape from " Code barre:" section (e.g. 252418556458)
Column J = Tariff= Scrape from "Tarif:" section (e.g. 834)
Columns K,L,M etc etc Contain optional track names from the white box e.g. K="House of the Rising Sun", L="Iya guduza"
I estmiate 100 lines of code.
If you are interested in bidding then please include the following
- Your location & time zone
- Fluency in English
- Your payment terms
- Your usual availabilty on Skype/MSN (hours per day)