Python - Scrape PDF, website and insert into PostgreSQL database

Cancelled Posted 7 years ago Paid on delivery
Cancelled Paid on delivery

I need a Python script written that when given a URL to a PDF file will scrap data from the PDF and [url removed, login to view] and insert the data into a PostgreSQL database. Comprehension of American football is recommended for this project.

Data source example:

[url removed, login to view]

[url removed, login to view]

[url removed, login to view]

The specific data that needs to be scraped from the PDF is Playtime Percentage, which is typically located on the last page(s) of the PDF. In addition to scraping the PDF, each player needs to be searched on [url removed, login to view] and their unique GSIS ID needs to be scraped from their [url removed, login to view] player page. A unique game ID also needs to be obtained from the weekly game page on [url removed, login to view]

For example the GSIS ID for Cam Newton is: 00-0027939

As found in the HTML here: [url removed, login to view]

Please be aware that some players have very similar names. Therefore when searching for a player to obtain their GSIS ID you need to ensure it is for the correct player as the PDF only gives a first initial and last name. You can achieve this by searching [url removed, login to view] and verifying that the player's position matches the PDF and that they played the game described on the PDF from their game logs on [url removed, login to view], game dates, opponents and other identifying information can all be found on the PDF. Also please be mindful that some PDF files that will be fed into the script may be several years old and players may have changed teams since then.

To obtain the game ID you would extract information from the given PDF URL.

For example:

[url removed, login to view]

The above URL gives us the following information:

YEAR: 2015

TYPE: reg

WEEK: 01

With that information you would scrape data on the corresponding week’s NFL page.

The URL formation is:

[url removed, login to view](CAPS)WEEK(WITHOUT LEADING 0)

Which would result as:

[url removed, login to view]

On the week’s NFL page you would then obtain the game ID from the HTML. In the above example the game ID is: 2015091300

The database should be structured as such:

game_id – This is the game id obtained from the NFL game page.

player_id – Player’s unique GSIS ID obtained from the player’s NFL profile page.

player_name – This is the 1st column of the Play Percentage page in the PDF.

position – This is the 2nd column of the Play Percentage page.

team – This is the team the player played for at the time of the game.

off_snaps – This is the 3rd column of the Play Percentage page (0 if blank).

off_pct - This is the 4th column of the Play Percentage page (0 if blank).

def_snaps - This is the 5th column of the Play Percentage page (0 if blank).

def_pct - This is the 6th column of the Play Percentage page (0 if blank).

spt_snaps - This is the 7th column of the Play Percentage page (0 if blank).

spt_pct - This is the 8th column of the Play Percentage page (0 if blank).

If the script encounters a PDF that doesn't have the requested stats, the script should return "Unavailable" and not insert anything into the database. Blank or empty cells in the PDF's table shall be replaced by 0.

If you have any questions or need additional explanation or examples please don’t hesitate to ask.

PDF PostgreSQL Python Web Scraping

Project ID: #11548845

About the project

1 proposal Remote project Active 7 years ago

1 freelancer is bidding on average $250 for this job

mantislin

Hi sir, I am scraping expert, I have did too many similar projects, please check my feedback then you will know. Can you tell me more details? then I will provide demo data for you. Thanks, Kimi

$250 USD in 5 days
(118 Reviews)
6.7