Completion Report
Project Summary:
SAC011 - Completion Report | ||||||||||||||||||||||||||||||||
Were the project goals met? Objectives: 1) Upgrade SITS Student Record to a supported version of SITS. 2) Upgrade the backend Oracle DB to a supported version (11g)- to 11.203 (incl HUB which is on 11.202, and STAR which is on 10g) 3) Minimise downtime for the LIVE system during upgrade 4) Allow sufficient time for testing to reduce chance of critical post-go live issues 5) Minimise overall time duration of the upgrade project - so that there is minimal disruption to other project work (ie. quick back-to-back deployments into DEV / TEST / LIVE) . Yes - DEV / TEST / LIVE & TRN upgraded to SITS v8.6.0 and Oracle upgrade to 11.203 (11g point release) Were the project deliverables fully or partially accomplished? Main Deliverables: 1) Refresh DEV, Upgrade STARDEV to 11g (11.203), HUBDEV to 11.203,EUCLID DEV to v8.6.0 (Dec 2012) 2) Complete peer test of DEV by Config team, and any required re-work, and confirm if downtime for HUB / EUGEX is acceptable 3) Upgrade STARTEST to 11g , EUCLID TEST to v8.6.0 (Jan 2013) 4) Complete integration test of TEST by SACS, and any required re-work 5) Upgrade STARLIVE to 11g, EUCLID LIVE to v8.6.0 6) Upgrade STARTRN to 11g,EUCLID TRN to v8.6.0 ( once any post go-live issues have been resolved, by Apr 2013)
If partial, what are the reasons for this? -Extended downtime for DEV as first upgrade failed -Scheduling of the refreshes proved difficult , as it proved difficult to get a reliable cold backup / extract of LIVE to be used as the source files for the refresh (Disk space issue - file sizes were larger and this caused a problem when copying files around for the refresh , additional disk (SAN) had to be attached to the DEV and TEST servers to faciliate the refresh process
Were there additional deliverables added? Additional deliverables: -DEV and TEST refreshed (was planned as a support task originally) with a current snapshot of data from LIVE (project was going to do the DEV refresh, and support were going to do the TEST refresh , in fact it made sense for one team to do both refreshes, and there was not resource in the support team to complete the refresh work ) -Issue with slow running HESA return process resolved by addition of new index Did the project deliver a solution to the problems being addressed? -After the upgrade of LIVE, some performance issues were identified with both eVision, retrievel of Uploaded documents and the refresh of views in EUGEX. These post-go live issues required more effort to investigate and fix. -New indexes were identified by Defeng Ma as required on MEN_DOC (via SAC011-97) , and adding these indexes improved the retrieval times for uploaded documents -To aid future performance tuning work , post live work was started as part of SAC011 project to deliver a set of automated / scripted e-Vision tests which can now be run before any upgrade to set a benchmark , then re-run again after upgrade for comparison purposes. They will need to be further developed/expanded to be used in future upgrade to test the performance in DEV , TEST , TRN and LIVE to pick up any potential performance issues earlier. Does the Project Sponsor agree that this project can be closed at this time? YES | ||||||||||||||||||||||||||||||||
Cost Summary | ||||||||||||||||||||||||||||||||
Project Manager's Commentary on Reasons For Variance From Plans For most projects the Project Brief should provide a reasonably accurate and detailed estimate of the staff and non staff costs for the project. Project Management research suggests that the actual resources used should typically be within 20% of this figure. The Project Manager should identify any events which required resource or schedule changes for the project. Additional effort to repeat upgrade of DEV (after first DEV upgrade failed due to missing step in implementation plan) Additional effort for performance investigation , after performance issues reported in LIVE The increase of the SACS effort is due to effort spent on Analysis for issues with Boiler Plates Text, Review of Regression scripts and EUGEX refresh issues | ||||||||||||||||||||||||||||||||
Key Learning Points | ||||||||||||||||||||||||||||||||
Consider in particular: 1. What went well? LIVE upgrade went to schedule out-of-hour over the weekend 2. What didn't go so well? 2a] Scheduling refreshes for DEV and TEST, problem getting reliable backup of LIVE (LIVE DB has grown massively ) 2b] DEV double upgrade , failed first time, client steps were skipped after the initial v8.5.1 upgrade - these client steps should be done after each upgrade - as SITS upgrades have to be applied sequentially, and and some post-upgrade client steps must be done before the next upgrade is started. 2c] Boiler plate text BPT (UoE customised boilerplate text) [see JIRA SAC011-19] "So do we just need to populate 'EUCLID_BPT' as the Language on all USR records? (EUCLID_BPT is the only language code (MEN_LAN), and is used on all LLP records) And that would then mean that it wouldn't matter if our PRA boilerplate were overwritten by Tribal, as the client would pull the boilerplate text from LLP instead of PRA" ACTION:IS Apps Mgmnt toupdate the client implementation steps for the SITS upgrade - to include this step for the new process for dealing with Boilerplate text: "Ensure that LTP option 1 and XPRA option 1 are being run correctly during the upgrade." Pride (Dev tech) has confirmed that this step is not in the existing SITS upgrade scripts , so Suran this task will need to fall to you or one of your colleagues in Apps mgmt when the next SITS upgrade is done, This new process will need to be checked in DEV as part of the next upgrade project - before being applied to TEST / LIVE / TRN, 2d] During 12-13 there were 2 projects delivering changes to SITS back to back - SAC013 to retire the EUCLID web servers and then SAC011 to deliver the SITS upgrade. These 2 projects should have been separated more clearly as they created some conflicts with resource bookings in the Dev tech team and for some issues like performance which took some time to manifest , it was then difficult to isolate the source of the issues (were they caused by SAC013 or SAC011) - issues with eVision health check 3. If you had a project like this again, what would you improve? Oracle Streams re-build after re-fresh and then after upgrade - need to check implementation steps, timing and add testing time for streams after re-build (Apps Mgmt - Suran- need 1 day for row counts on view, notes on timing of refresh jobs) . Warning - a lot of variation for individual jobs in EUGEX in DEV - need 1 week to compare average run times week before and week after EFin is fed from both STUDINT and EUGEX. The change has been that the standing data for EFin is now being fed from HUB rather than picking up a file from the web server. Check all Cron jobs enabled - need a list of jobs before and check that these jobs are enabled after, For LIVE - can we request a roll-out of software updates from Tribal - introduces risk if process different from DEV and TEST Issues around dealing with critical software updates - in principle there is a change freeze apart from critical software updates. For 8.6.0 some software updates broke the Asynchronous TRAN No holding page for referees Early identification of performance issues, and timing tests should be taken not just in e-Vision and client , but also to include the long-running EUGEX jobs in DEV and TEST, to highlight any changes after upgrade Prior to the upgrade , standard performance tests should be run to establish a benchmark, then the same tests should be repeated after the upgrade to quickly identifiy any performance issues. Defeng Ma has now got a set of automated tests / queries for SITS eVision, which can be re-used but will need to be further developped for the next upgrade. | ||||||||||||||||||||||||||||||||
Outstanding Issues | ||||||||||||||||||||||||||||||||
Identify any problems or work that will have to be done following project closure. -EUGEX course refresh - close monitoring to ensure that any performance changes over-time are picked up early Still split into 2 jobs: Student refresh - most of the time at the moment it is not under the hour Assessment - over the hour Course enrolment refresh - over 2 hours (If indexes removed course refresh can take 20 mins - but indexes needed for reporting) https://www.jira.is.ed.ac.uk/jira/browse/SAC011-96 still outstanding: further test to be done in TEST, before deciding to deploy it in LIVE. Agreed to do this as part fo Support so that project can close. -Load testing was not in scope for this project. Going forward, and given the start of Freshers week issue this year, an approach for load testing needs to be agreed and consideration on the best way to take it forward either as a separate or within an existing project. Once in place, it will provide greater confidence for times of peak load . Future high use projects would need to add to whatever framework is in place. -It can be a challenge to keep TRN at the correct release level for SITS config changes. It is used as the 'golden-copy' of SITS config before any change applied to LIVE, and the config is only applied to TRN once deployment to LIVE is confirmed as good (around 4 weeks after).
Other points to consider: Database parameter settings can get out of line between the different environments - part of the brief of the annual upgrade project should check and standardise database settings across the envs. During SAC011 , anomalies were picked up in TEST and TRN and were corrected as part of the project work
see SAC011-110 (changes applied 6-June 2013)
Changes to STARTRN to bring it in line with the configuration on STARTEST and below is a table summarising the changes made.
** As of 09/09/13 - Cursor-Sharing set to EXACT for all Envs to resolve performance problem on first Monday of Freshers Week - when EUCLID failed and was unavailable
Whilst doing the comparison Iit was noted that STARTEST was configured to use ASMM (Automatic Shared Memory Management) whereas all the other environments DEV, TRN and LIVE have this disabled. Mark Lang confirmed that STARTEST should also have ASMM disabled to bring it in line with other databases so Pride took the opportunity to disable ASMM on STARTEST.
Also the memory settings on STARTRN were on the low side so it was decided to add a bit more and below is a table summarising the STARTRN values before and after in comparison to STARTEST.
EUGEX issues and the Course refresh see full detail recorded in JIRA https://www.jira.is.ed.ac.uk/jira/browse/SAC011-96 Notes from Meeting with Mark Lang, Pride Shoniwa, Richard Good, Peter Pratt, Suran Perera 23/04/13 : Source of the EUGEX COurse refresh issue is likely to be an Oracle patch - when Oracle on HUBLIVE was upgraded from 11g to 11.203 A] Peter is following up some advice from the 11.203 patch release notes, re. Union All He has made a change in DEV and is now checking how long the Course refresh will take to complete in DEV B] Mark will set up an SQL profile for the Course refresh job in TEST. This can be done in parallel with the work being done by Peter in DEV C] if neither A] or B] improve performance, then next step would be to replace materialised view with a view or table. And also re-write the Course refresh job to improve performance. This would require extensive UAT testing by SACS - this could be done as a comparison of the output data before and after. This action C] is on hold , until results from A] and B] are known [Action C was subsequently partially done by Richard Good - who re-wrote some of the SQL to improve performance) Other issues discussed: The Course refresh used to take 45 mins to complete in LIVE (pre-upgrade) , it now takes 4 hours plus. The same performance is shown in DEV and TEST, though behaviour in DEV and TEST is not as consistent as LIVE Additional indexes were added to the Course refresh after the Triple upgrade project in March 2012 (STU191). Richard has seen some performance issues previously when unicode data is translated to non-unicode - this is the scenario for SITS / EUGEX, SITS is unicode, EUGEX is non-unicode Peter raised : issues around 'Cursor sharing ' and 'bind variables' not thought significant. Replacing hardware isn't on the horizon for some time - EUCLID and EUGEX have dedicated EUCLID hardware. Lessons Learned: 1) be careful with Oracle patches, need timing for EUGEX refresh jobs before and after each Oracle patching task 2) Best to take snapshots of Oracle 'explain plan' for specific vulnerable jobs like the the Course refresh before and after each upgrade Note of IS Apps EUGEX Review meeting - 29-May 2013: -Present: Iain Fiddes, Defeng Ma, Dave Berry, Franck Bergeret, Peter Pratt, Richard Good, Suran Perera, Pride Shoniwa, Geir Granum, Jamie Thin -Baseline : Problem first noted after the SAC011 SITS upgrade to LIVE on 4-March 2013 (see JIRA SAC011-96), when course refresh jumped from 45mins to 6hrs. This issue replicated in DEV, and timing in DEV for Course refresh was between 6hrs and 10hrs. Subsequent action has been to split the Assessment refresh from the Course refresh in LIVE - to get more timely assessment data for Exam data analysis and Exam boards, and to check the Oracle settings (as part of the upgrade of HUBLIVE to Oracle 11.203 - which was done as part of SAC011 project) Course refresh now running in 3hrs in LIVE (as at 29-May 2013 - improved from 6hrs to 3hrs after Oracle parrallelism setting changed) [Update 17-Sept-2013, after subsequent improvements , the Course Refresh is now taking approx 2 hrs to complete in LIVE - some variation in timing depending on time of day and other jobs which are running at that time] Iain Fiddes summarised outcome of recent investigation in DEV, the team have now got the Course refresh to complete in 15 mins in DEV by re-writing some SQL and by temp switching off (invalidating) indexes, then re-creating the indexes after the refresh. This indicates that maintaining indexes during the refresh is a costly process - however indexes can't be temp switched off in LIVE during the day when users are running queries against EUGEX [certain BOXI reports which run against EUGEX would fail with indexes switched off]
-Index issue: Query and update processes expensive with indexes switched on Indexes originally added for ESSMU (to feed BOXI) , Personal Tutors, original indexes will be documented in original projects eg. STU139 and Personal Tutor projects If indexes switched off during refresh , refresh job can complete in 15 mins in DEV, then can switch indexes back on after refresh - but impacts user service while indexes switched off
ACTION: promote Richard Good's SQL changes for the Course refresh from DEV to TEST to LIVE , need to compare outputs in TEST, and get SACS sign-off before this change is promoted to LIVE (DONE) Once the SQL changes are in LIVE, get new timings for Course refresh in LIVE, if timing still > 1 hour , further action will be required.
Iain - assessment refresh running multiple times in LIVE and overlapping with Course refresh, best to avoid overlap of refreshes where possible, otherwise it can result in locks
ACTION : review timings of scheduled jobs overnight currently these run hourly until midnight and re-start at 3am, Suran - advises this schedule could run until 11pm, then re-start at 7am (Suran will check with SACS and confirm any changes to existing scheduled jobs)
-Future tasks Further follow-up actions that could be done if required (and once budget approved - can't be funded from SAC011 )
ACTION: Complete review of EUGEX indexes (long-term action) - could split these out into 3 groups? - need to understand impact on business user operations Oracle optimiser behaviour has changed after upgrade to Oracle 11.203 Can Mark Lang measure which indexes are used? Which SQL queries are run most often?
Defeng - can Dev Tech enable log to see which indexes are used (Iain noted that this patter of usage changes during the year) Pattern of usage varies across the year , say in 3-monthly blocks..depending on business cycle : application / admission usage, registration / enrolment, assessment etc
Peter noted one problem is that historic data is increasing year on year, and the static historic data is also refreshed every hour (historic data is not always static)
Could a solution be to only refresh current year and 3 previous years .. .not all previous historic data? what new issues would be created by partially updating the data? STUDMI is already partitioned by year, solution could be found, but for some exceptional PGR students may need to refresh more years of data eg. up to 7 years if student interupts study?
-Future work that could improve EUGEX refresh times: Future SITS Data retention project for 13-14 (SAC017 but this has Applicant focus only, not Student data)
-Other Background notes: Original EUGEX project (STU139) was current data only, but was extended to historic data part-way through the project..this was late change to design, ESSMU .... what is range of data that can be queried via ESSMU Boxi universe ..is it current data only? Karen: 'EUGEX doesn’t include Historic data, it has the current data for a student, but this does mean that if a student left year ago, then that data is available, but if a student is now in their second year, you can not see the details of their first year. ESSMU reports off EUGEX so it has the current data. The users of ESSMU would like it to be historic.'
-Further work on indexes may require a new project to resource this effort Suran to check with SACS for advice as to which reports in ESSMU are the key reports / most frequently run reports which should be prioritised to be performant. Peter has noted Boxi users can take a standard report and then modify it to create their own local custom report, this flexibility will create a lot of diversity across the reports run. Karen: 'The whole idea is that BOXI is an end user query tool, so Peter’s comment is very pertinent. Not only can they change a standard report but they can create their own, so the emphasis can not solely be on Standard reports, although with the BOXI upgrade, standard reports may be the only reports available, so that may help in the process.'
| ||||||||||||||||||||||||||||||||
Review Contributors | ||||||||||||||||||||||||||||||||
List all contributors to the review, their role and relationship to the project and the date of their contribution.
Jamie Thin, Projet Manager Franck Bergeret, Programme Manager Defeng Ma, Technical Lead SSP IS Pride Shoniwa, Technical Architect Suran Perera, IS Apps Man Rachel Lacaille ,BA SSP SACS Karen Osterburg, Head BPE SSP SACS
|
Analysis of Resource Usage:
Staff Usage Estimate: 157 days
Staff Usage Actual: 180 days
Staff Usage Variance: 15%
Other Resource Estimate: days
Other Resource Actual: days
Other Resource Variance: 0%
Explanation for variance:
Key Learning Points:
Outstanding issues: